What practical tips does the substr_count function have when dealing with multibyte character encoding?

gitbox 2025-06-03

In PHP, the substr_count() function is widely used to calculate the number of times a substring appears in a string. It is very efficient when dealing with ASCII characters, but may have unexpected results when faced with multi-byte character encodings such as UTF-8. This article will introduce practical skills of substr_count() when dealing with multibyte character encoding, and help you avoid pitfalls in development through specific examples.

1. Review of basic usage

The basic syntax of substr_count() is as follows:

<code> int substr_count ( string $haystack , string $needle [, int $offset = 0 [, int $length ]] ) </code>

This function returns the number of times $needle appears in $haystack . It should be noted that it is a function that processes strings by bytes and does not recognize character boundaries.

2. Problems caused by multi-byte characters

For example, if you try to count the number of times a Chinese character "you" appears in a string, you may get an error:

<code> $str = "Hello, you're really good"; echo substr_count($str, "you"); // The output may be incorrect</code>

The reason is that the Chinese "you" is three bytes in UTF-8, but substr_count() does not recognize the character boundaries and only matches by bytes. This situation can easily lead to matching errors or missing numbers.

3. Tips for using mb_substr instead of substr_count()

Although PHP does not have a special mb_substr_count() function, similar effects can be achieved by combining functions such as mb_substr() and mb_strlen() .

For example, you can use mb_split() to split a string and count the number of occurrences:

<code> $str = "Hello, you're really good"; $arr = mb_split("you", $str); $count = count($arr) - 1; echo $count; // Correct output 2 </code>

This way, the problem of byte-level misjudgment is avoided and is suitable for multi-byte encoding.

4. Regular mode is compatible with UTF-8

Another common way is to use preg_match_all() with the UTF-8 modifier:

<code> $str = "Hello, you're really good"; preg_match_all('/you/u', $str, $matches); echo count($matches[0]); // Output 2 </code>

The /u modifier here tells the regengine to use UTF-8 mode to process strings, thus ensuring that "you" is correctly recognized as a character.

5. Statistics the number of occurrences of specific paths or parameters in the URL

If you are processing a string containing a URL and the URL contains Chinese paths or parameters, it is recommended to use rawurlencode() or urldecode() to process it uniformly before matching. For example:

<code> $url = "https://gitbox.net/Hello/Hello.html"; $decoded = urldecode($url); preg_match_all('/Hello/u', $decoded, $matches); echo count($matches[0]); // Output 2 </code>

This can avoid interference from Chinese after URL encoding and ensure statistics accuracy.

6. Summary

Substr_count() itself is not suitable for multibyte character encoding, but it can be effectively compensated by the following techniques:

Use mb_split() to split and count
Use regular expressions to match preg_match_all() with /u modifier
Perform urldecode() preprocessing on the URL and match it
Avoid using substr_count() directly to perform frequency analysis of multi-byte characters such as Chinese, Japanese, and Korean

Mastering these skills can greatly improve the accuracy and stability of the program when developing multilingual websites, processing natural languages or processing UTF-8 data from platforms such as gitbox.net .