How to avoid common problems and errors when intercepting Chinese strings using mb_strcut?

gitbox 2025-05-27

mb_strcut in PHP is a very practical function when dealing with multibyte characters (such as Chinese). It is used to safely intercept multi-byte strings to avoid garbled code problems due to truncated characters. However, many developers will encounter some common pitfalls and errors when using mb_strcut . This article will introduce in detail how to use the function correctly and point out solutions to common problems.

1. Understand the difference between mb_strcut and mb_substr

Before digging into the question, let's first clarify a common misunderstanding: although mb_strcut and mb_substr look similar, their behaviors are very different.

mb_substr is intercepted based on "character", that is, intercepting a specified number of characters.
mb_strcut is an intercept based on "bytes". It tries to intercept a number of bytes starting from a byte position and try not to destroy character integrity.

This means that when processing Chinese (usually UTF-8 encodes the next Chinese character to be 3 bytes), if you calculate the byte position and length inaccurately, it may be truncated in the middle of a character, resulting in garbled output.

2. The correct way to use mb_strcut

Suppose we need to intercept a Chinese string and make sure that the characters are not corrupted due to mismatch of bytes:

 <?php
$str = "Welcome to visitgitbox.net，This is a Chinese string for demonstration。";
$cutStr = mb_strcut($str, 0, 18, "UTF-8");
echo $cutStr;
?>

The above code is intended to intercept the first 18 bytes. But note:

If the string contains Chinese (3 bytes of a Chinese character), then the 18 bytes may be truncated just in the middle of a character.
mb_strcut will try to avoid truncating characters, but its behavior depends on the encoding used.

Therefore, make sure that the fourth parameter (encoding) of mb_strcut must be specified correctly, usually "UTF-8" .

3. Frequently Asked Questions and Solutions

1. Output garbled code

This is the most common problem. The reasons are usually:

The correct encoding is not set.
The starting position or length of the intercept causes the character to be truncated.

Solution:

Always use UTF-8 encoding and make sure that the output environment (such as HTML pages) is also UTF-8.

 header("Content-Type: text/html; charset=utf-8");

2. The intercept length does not meet expectations

For example, if you want to display "10 characters" instead of "10 bytes", then mb_strcut will not apply because it is based on bytes. You should use mb_substr :

 $cutStr = mb_substr($str, 0, 10, "UTF-8");

3. Character loss or truncation error

When you start intercepting bytes from the middle (for example, starting from the 5th byte), it may fall just in the middle of a character, resulting in interception failure or output exception.

suggestion:

Intercept as much as possible from character boundaries (rather than byte offsets).
If you must operate based on bytes, you can first use mb_strcut to gradually test the output effect.

4. Suggested encapsulation functions

To avoid repeated mistakes, you can encapsulate a function that safely intercepts Chinese strings:

 function safeCutStr($string, $length, $charset = "UTF-8") {
    return mb_strcut($string, 0, $length, $charset);
}

Before the page output, you can also add a post-processing to determine whether the last character is complete and if necessary, omit incomplete characters.

5. Summary

When dealing with multi-byte character sets such as Chinese, using mb_strcut can indeed improve interception efficiency, but you also need to be careful enough about the relationship between bytes and characters. To avoid garbled code issues as much as possible:

Always specify the correct encoding (such as UTF-8);
Use mb_substr as much as possible to intercept characters;
If it must be intercepted by bytes, consider encapsulation fault tolerance logic.

Using mb_strcut rationally can make your PHP program more robust and stable when processing Chinese.