Current Location: Home> Latest Articles> Common problems when mb_strcut function handles special characters

Common problems when mb_strcut function handles special characters

gitbox 2025-05-26

What is mb_strcut?

The function of the mb_strcut function is to intercept string fragments of the specified byte length from a multibyte string. It is similar to mb_substr , but the difference is that mb_strcut is intercepted in units of bytes, not characters.

 <?php
$str = "This is a test string";
echo mb_strcut($str, 0, 6, "UTF-8"); // Output“This is a”
?>

Here 6 is the number of bytes (UTF-8 encoding, a Chinese character usually accounts for 3 bytes), so the first two Chinese characters are actually intercepted.


Will there be an error when encountering special characters?

Special characters may refer to emojis, special symbols, combined characters (such as letters with diacritic notes), etc. These characters tend to occupy more than 3 bytes in UTF-8, and may even take up 4 or more bytes.

1. There may be truncation problems

Because mb_strcut is intercepted based on byte count, if the intercept length is just truncated part of a multi-byte character, it will cause garbled or incomplete characters to appear in the truncated string.

Example:

 <?php
$str = "Hello ?? World";
echo mb_strcut($str, 0, 8, "UTF-8"); // 可能Output“Hello ”Heel garbled
?>

Here is an emoji that occupies 4 bytes. If the intercepted length falls in the middle of the emoji bytes, the characters will be cut off, resulting in garbled code.

2. Supports 4-byte characters such as emoji

Since PHP mb_strcut , the mbstring extension supports 4-byte characters better, but still need to pay attention to intercepting length and character boundaries.


FAQ Summary

question illustrate Solution
Character truncation causes garbled code Cut the multi-byte characters in the same length, resulting in incomplete string Use mb_substr instead, intercept by character
4-byte character handling exception 4 byte emoji is incomplete when intercepted Upgrade the PHP version and use mbstring that supports 4 bytes
Byte and character length confusion mb_strcut is intercepted by bytes, mb_substr is intercepted by characters, it is easy to make mistakes when mixed. Clarify the requirements and select the corresponding functions
Character encoding is inconsistent The incoming encoding does not match the actual encoding of the string, resulting in interception exception Confirm the string encoding and pass it in correctly

Solution Example

Use mb_substr to avoid garbled code

mb_substr is intercepted by characters, and will not truncate half a multi-byte character, avoiding garbled code.

 <?php
$str = "Hello ?? World";
echo mb_substr($str, 0, 7, "UTF-8"); // Output“Hello ??”
?>

Use mb_strcut to determine the boundary

If you have to use mb_strcut , it is recommended to manually detect whether the intercepted point is a complete character boundary, or use mb_strlen to obtain the number of characters, and then calculate the corresponding number of bytes.


Conclusion

mb_strcut is a powerful tool when dealing with multi-byte strings, but because it intercepts by bytes, it may produce garbled or truncated exceptions when encountering special characters (especially 4-byte emoji). Understanding the difference between bytes and characters, choosing mb_strcut or mb_substr reasonably, and ensuring that the character encoding is consistent is the key to avoiding problems.