The basic syntax of mb_strpos() is as follows:
<span><span><span class="hljs-title function_ invoke__">mb_strpos</span></span><span>(</span><span><span class="hljs-keyword">string</span></span><span> </span><span><span class="hljs-variable">$haystack</span></span><span>, </span><span><span class="hljs-keyword">string</span></span><span> </span><span><span class="hljs-variable">$needle</span></span><span>, </span><span><span class="hljs-keyword">int</span></span><span> </span><span><span class="hljs-variable">$offset</span></span><span> = </span><span><span class="hljs-number">0</span></span><span>, </span><span><span class="hljs-keyword">string</span></span><span> </span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-literal">null</span></span><span>): </span><span><span class="hljs-keyword">int</span></span>|</span><span><span class="hljs-literal">false</span></span><span>
</span></span>
$haystack: The target string.
$needle: The substring to search for.
$offset: The starting position for the search, default is 0.
$encoding: Character encoding. By default, PHP automatically selects the system’s current encoding.
Unlike strpos(), mb_strpos() is multibyte-safe and is especially important when handling strings in various encodings like UTF-8, GBK, or BIG5.
The main issue is that mb_strpos() may return different match positions when processing strings with different encodings. Typically, using the same string and search character can yield different position indices depending on the encoding. Why does this happen?
Character Encoding and Byte Length:
Character encoding determines how many bytes a character occupies in memory. UTF-8 is a variable-length encoding, with characters taking 1 to 4 bytes, while GBK is generally a two-byte encoding. mb_strpos() searches according to the character encoding, so in UTF-8, the match position is affected by the byte length of characters.
Handling Multibyte Characters:
When processing multibyte characters, mb_strpos() considers the actual memory length of the character, not just the count of characters. For instance, a Chinese character like "你" takes 3 bytes in UTF-8 but only 2 bytes in GBK, which can shift its position across different encodings.
Impact of Encoding Mismatch:
If mb_strpos() operates with the default encoding and the string and search character encodings differ, it may return inaccurate results. Different byte representations lead to miscalculated positions.
Ensure consistent encoding for both string and search character:
Use mb_internal_encoding() to check the default encoding of your PHP script, and ensure that both the target string and search character share the same encoding. You can convert encodings with mb_convert_encoding(), for example:
<span><span><span class="hljs-variable">$haystack</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$haystack</span></span>, </span><span><span class="hljs-string">'UTF-8'</span></span>, </span><span><span class="hljs-string">'auto'</span></span><span>);
</span><span><span class="hljs-variable">$needle</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$needle</span></span>, </span><span><span class="hljs-string">'UTF-8'</span></span>, </span><span><span class="hljs-string">'auto'</span></span><span>);
</span></span>
This ensures that strings are uniformly converted to UTF-8 regardless of their original encoding.
Explicitly specify encoding:
When calling mb_strpos(), always specify the encoding. Even if the default encodings differ, specifying it prevents inconsistent results, for example:
<span><span><span class="hljs-variable">$position</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strpos</span></span><span>(</span><span><span class="hljs-variable">$haystack</span></span><span>, </span><span><span class="hljs-variable">$needle</span></span><span>, </span><span><span class="hljs-number">0</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span></span>
This ensures searches are performed in UTF-8 encoding.
Check encoding validity:
When handling user input or external strings, always verify encoding validity. Use mb_check_encoding() to ensure the string is a valid multibyte encoding:
<span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">mb_check_encoding</span></span><span>(</span><span><span class="hljs-variable">$haystack</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>) && </span><span><span class="hljs-title function_ invoke__">mb_check_encoding</span></span><span>(</span><span><span class="hljs-variable">$needle</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>)) {
</span><span><span class="hljs-variable">$position</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strpos</span></span><span>(</span><span><span class="hljs-variable">$haystack</span></span><span>, </span><span><span class="hljs-variable">$needle</span></span><span>, </span><span><span class="hljs-number">0</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
}
</span></span>
This prevents errors caused by invalid encoding.
Debug and test:
During development, test string searches under different encodings to ensure consistent mb_strpos() behavior. Tools like bin2hex() can help inspect the actual byte representation of characters:
<span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-title function_ invoke__">bin2hex</span></span><span>(</span><span><span class="hljs-variable">$haystack</span></span><span>);
</span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-title function_ invoke__">bin2hex</span></span><span>(</span><span><span class="hljs-variable">$needle</span></span><span>);
</span></span>
This helps understand memory storage of characters and optimize code accordingly.
mb_strpos() is a powerful function for searching multibyte strings and supports multiple character encodings, but results may vary across different encodings. The primary reason is that encoding determines character byte length, which affects position calculation. The key to resolving this issue is ensuring consistent encoding for strings and search characters and explicitly specifying encoding when calling the function. Additionally, encoding validation and thorough testing are crucial for stable code.
By properly managing and converting encodings, we can prevent inconsistent mb_strpos() positions in multibyte environments, making string searches more accurate and reliable.