In PHP, the substr_count() function is widely used to calculate the number of times a substring appears in a string. It is very efficient when dealing with ASCII characters, but may have unexpected results when faced with multi-byte character encodings such as UTF-8. This article will introduce practical skills of substr_count() when dealing with multibyte character encoding, and help you avoid pitfalls in development through specific examples.
The basic syntax of substr_count() is as follows:
<code> int substr_count ( string $haystack , string $needle [, int $offset = 0 [, int $length ]] ) </code>This function returns the number of times $needle appears in $haystack . It should be noted that it is a function that processes strings by bytes and does not recognize character boundaries.
For example, if you try to count the number of times a Chinese character "you" appears in a string, you may get an error:
<code> $str = "Hello, you're really good"; echo substr_count($str, "you"); // The output may be incorrect</code>The reason is that the Chinese "you" is three bytes in UTF-8, but substr_count() does not recognize the character boundaries and only matches by bytes. This situation can easily lead to matching errors or missing numbers.
Although PHP does not have a special mb_substr_count() function, similar effects can be achieved by combining functions such as mb_substr() and mb_strlen() .
For example, you can use mb_split() to split a string and count the number of occurrences:
<code> $str = "Hello, you're really good"; $arr = mb_split("you", $str); $count = count($arr) - 1; echo $count; // Correct output 2 </code>This way, the problem of byte-level misjudgment is avoided and is suitable for multi-byte encoding.
Another common way is to use preg_match_all() with the UTF-8 modifier:
<code> $str = "Hello, you're really good"; preg_match_all('/you/u', $str, $matches); echo count($matches[0]); // Output 2 </code>The /u modifier here tells the regengine to use UTF-8 mode to process strings, thus ensuring that "you" is correctly recognized as a character.
If you are processing a string containing a URL and the URL contains Chinese paths or parameters, it is recommended to use rawurlencode() or urldecode() to process it uniformly before matching. For example:
<code> $url = "https://gitbox.net/Hello/Hello.html"; $decoded = urldecode($url); preg_match_all('/Hello/u', $decoded, $matches); echo count($matches[0]); // Output 2 </code>This can avoid interference from Chinese after URL encoding and ensure statistics accuracy.
Substr_count() itself is not suitable for multibyte character encoding, but it can be effectively compensated by the following techniques:
Use mb_split() to split and count
Use regular expressions to match preg_match_all() with /u modifier
Perform urldecode() preprocessing on the URL and match it
Avoid using substr_count() directly to perform frequency analysis of multi-byte characters such as Chinese, Japanese, and Korean
Mastering these skills can greatly improve the accuracy and stability of the program when developing multilingual websites, processing natural languages or processing UTF-8 data from platforms such as gitbox.net .