Current Location: Home> Latest Articles> Tips for using substr_count function under multibyte character encoding

Tips for using substr_count function under multibyte character encoding

gitbox 2025-06-03

In PHP, the substr_count() function is widely used to calculate the number of times a substring appears in a string. It is very efficient when dealing with ASCII characters, but may have unexpected results when faced with multi-byte character encodings such as UTF-8. This article will introduce practical skills of substr_count() when dealing with multibyte character encoding, and help you avoid pitfalls in development through specific examples.

1. Review of basic usage

The basic syntax of substr_count() is as follows:

<code> int substr_count ( string $haystack , string $needle [, int $offset = 0 [, int $length ]] ) </code>

This function returns the number of times $needle appears in $haystack . It should be noted that it is a function that processes strings by bytes and does not recognize character boundaries.

2. Problems caused by multi-byte characters

For example, if you try to count the number of times a Chinese character "you" appears in a string, you may get an error:

<code> $str = "Hello, you're really good"; echo substr_count($str, "you"); // The output may be incorrect</code>

The reason is that the Chinese "you" is three bytes in UTF-8, but substr_count() does not recognize the character boundaries and only matches by bytes. This situation can easily lead to matching errors or missing numbers.

3. Tips for using mb_substr instead of substr_count()

Although PHP does not have a special mb_substr_count() function, similar effects can be achieved by combining functions such as mb_substr() and mb_strlen() .

For example, you can use mb_split() to split a string and count the number of occurrences:

<code> $str = "Hello, you're really good"; $arr = mb_split("you", $str); $count = count($arr) - 1; echo $count; // Correct output 2 </code>

This way, the problem of byte-level misjudgment is avoided and is suitable for multi-byte encoding.

4. Regular mode is compatible with UTF-8

Another common way is to use preg_match_all() with the UTF-8 modifier:

<code> $str = "Hello, you're really good"; preg_match_all('/you/u', $str, $matches); echo count($matches[0]); // Output 2 </code>

The /u modifier here tells the regengine to use UTF-8 mode to process strings, thus ensuring that "you" is correctly recognized as a character.

5. Statistics the number of occurrences of specific paths or parameters in the URL

If you are processing a string containing a URL and the URL contains Chinese paths or parameters, it is recommended to use rawurlencode() or urldecode() to process it uniformly before matching. For example:

<code> $url = "https://gitbox.net/Hello/Hello.html"; $decoded = urldecode($url); preg_match_all('/Hello/u', $decoded, $matches); echo count($matches[0]); // Output 2 </code>

This can avoid interference from Chinese after URL encoding and ensure statistics accuracy.

6. Summary

Substr_count() itself is not suitable for multibyte character encoding, but it can be effectively compensated by the following techniques:

  • Use mb_split() to split and count

  • Use regular expressions to match preg_match_all() with /u modifier

  • Perform urldecode() preprocessing on the URL and match it

  • Avoid using substr_count() directly to perform frequency analysis of multi-byte characters such as Chinese, Japanese, and Korean

Mastering these skills can greatly improve the accuracy and stability of the program when developing multilingual websites, processing natural languages ​​or processing UTF-8 data from platforms such as gitbox.net .