mb_strcut in PHP is a very practical function when dealing with multibyte characters (such as Chinese). It is used to safely intercept multi-byte strings to avoid garbled code problems due to truncated characters. However, many developers will encounter some common pitfalls and errors when using mb_strcut . This article will introduce in detail how to use the function correctly and point out solutions to common problems.
Before digging into the question, let's first clarify a common misunderstanding: although mb_strcut and mb_substr look similar, their behaviors are very different.
mb_substr is intercepted based on "character", that is, intercepting a specified number of characters.
mb_strcut is an intercept based on "bytes". It tries to intercept a number of bytes starting from a byte position and try not to destroy character integrity.
This means that when processing Chinese (usually UTF-8 encodes the next Chinese character to be 3 bytes), if you calculate the byte position and length inaccurately, it may be truncated in the middle of a character, resulting in garbled output.
Suppose we need to intercept a Chinese string and make sure that the characters are not corrupted due to mismatch of bytes:
<?php
$str = "Welcome to visitgitbox.net,This is a Chinese string for demonstration。";
$cutStr = mb_strcut($str, 0, 18, "UTF-8");
echo $cutStr;
?>
The above code is intended to intercept the first 18 bytes. But note:
If the string contains Chinese (3 bytes of a Chinese character), then the 18 bytes may be truncated just in the middle of a character.
mb_strcut will try to avoid truncating characters, but its behavior depends on the encoding used.
Therefore, make sure that the fourth parameter (encoding) of mb_strcut must be specified correctly, usually "UTF-8" .
This is the most common problem. The reasons are usually:
The correct encoding is not set.
The starting position or length of the intercept causes the character to be truncated.
Solution:
Always use UTF-8 encoding and make sure that the output environment (such as HTML pages) is also UTF-8.
header("Content-Type: text/html; charset=utf-8");
For example, if you want to display "10 characters" instead of "10 bytes", then mb_strcut will not apply because it is based on bytes. You should use mb_substr :
$cutStr = mb_substr($str, 0, 10, "UTF-8");
When you start intercepting bytes from the middle (for example, starting from the 5th byte), it may fall just in the middle of a character, resulting in interception failure or output exception.
suggestion:
Intercept as much as possible from character boundaries (rather than byte offsets).
If you must operate based on bytes, you can first use mb_strcut to gradually test the output effect.
To avoid repeated mistakes, you can encapsulate a function that safely intercepts Chinese strings:
function safeCutStr($string, $length, $charset = "UTF-8") {
return mb_strcut($string, 0, $length, $charset);
}
Before the page output, you can also add a post-processing to determine whether the last character is complete and if necessary, omit incomplete characters.
When dealing with multi-byte character sets such as Chinese, using mb_strcut can indeed improve interception efficiency, but you also need to be careful enough about the relationship between bytes and characters. To avoid garbled code issues as much as possible:
Always specify the correct encoding (such as UTF-8);
Use mb_substr as much as possible to intercept characters;
If it must be intercepted by bytes, consider encapsulation fault tolerance logic.
Using mb_strcut rationally can make your PHP program more robust and stable when processing Chinese.