Current Location: Home> Latest Articles> mb_strcut tips to avoid encoding errors during interception

mb_strcut tips to avoid encoding errors during interception

gitbox 2025-05-26

When processing multibyte strings in PHP, the commonly used function is mb_strcut , which can intercept strings based on the number of bytes and is suitable for processing multibyte encoded text such as UTF-8. However, many developers often encounter encoding errors when using mb_strcut to intercept strings, resulting in garbled intercepting results or truncating half a character. This article will explain in detail how to correctly use mb_strcut to avoid encoding errors and share practical tips.


What is mb_strcut?

mb_strcut is one of PHP's multibyte string functions for intercepting strings by byte length. Unlike mb_substr , mb_strcut is intercepted based on bytes, not characters. It can more accurately control the interception length when dealing with multi-byte encoding, avoiding garbled characters being truncated.

Function prototype:

 mb_strcut(string $str, int $start, ?int $length = null, ?string $encoding = null): string
  • $str : Enter a string.

  • $start : Start position, calculated by number of bytes.

  • $length : The number of bytes intercepted (optional).

  • $encoding : string encoding, internal encoding is used by default.


Why do I get encoding errors?

When we use mb_strcut to intercept the string, if $start or $length falls inappropriately in the middle of the multi-byte character, garbled code will appear because the truncated character bytes are incomplete. Especially for UTF-8 encoding, a Chinese character is generally composed of 3 bytes. When intercepting bytes, it is necessary to ensure that the starting point and end point are both character boundaries.


Practical Tips to Avoid Coding Mistakes

1. Set the encoding clearly

When calling mb_strcut , explicitly specifying the encoding of the string is the first step to avoid problems caused by inconsistent default encoding.

 $encoding = 'UTF-8';
$result = mb_strcut($str, $start, $length, $encoding);

2. Use mb_strlen and mb_substr to detect boundaries

Before intercepting, use mb_strlen to get the string character length to avoid $start and $length from out of range. At the same time, combine mb_substr to ensure that half a character is not truncated.

 $length = 10;
if (mb_strlen($str, $encoding) > $length) {
    $result = mb_substr($str, 0, $length, $encoding);
} else {
    $result = $str;
}

3. Combining mb_strcut and mb_strlen to process bytes and characters conversion

If you have to intercept by the number of bytes, first calculate the complete number of characters corresponding to the intercepted byte range, and then use mb_substr to intercept.

 function safe_mb_strcut(string $str, int $start, int $length, string $encoding = 'UTF-8'): string {
    $substr = mb_strcut($str, $start, $length, $encoding);
    // mb_strcut Sometimes half a character may be truncated,Transcoding confirms whether it is valid
    if (mb_check_encoding($substr, $encoding)) {
        return $substr;
    }
    // If incomplete,Reduce length,Until complete
    while ($length > 0 && !mb_check_encoding($substr, $encoding)) {
        $length--;
        $substr = mb_strcut($str, $start, $length, $encoding);
    }
    return $substr;
}

4. Example: Handling UTF-8 multibyte string interception

 $str = "This is a test string,Includes Chinese andEnglish";
$start = 0;
$length = 15;  // Intercept by bytes

$result = safe_mb_strcut($str, $start, $length, 'UTF-8');
echo $result;

This avoids the garbled problem caused by byte truncation.


summary

  • mb_strcut intercepts multi-byte strings by bytes. Pay attention to character boundaries to avoid truncating half a character.

  • Identify encoding parameters to ensure that the function behavior is consistent.

  • The encoding integrity of the intercepted results can be verified in combination with mb_check_encoding .

  • Combining mb_strlen and mb_substr is more secure when character interception is needed.

Through the above techniques, encoding errors during multi-byte string interception in PHP can be effectively avoided, and the accuracy of text processing and user experience can be ensured.


 <?php
function safe_mb_strcut(string $str, int $start, int $length, string $encoding = 'UTF-8'): string {
    $substr = mb_strcut($str, $start, $length, $encoding);
    if (mb_check_encoding($substr, $encoding)) {
        return $substr;
    }
    while ($length > 0 && !mb_check_encoding($substr, $encoding)) {
        $length--;
        $substr = mb_strcut($str, $start, $length, $encoding);
    }
    return $substr;
}

$str = "This is a test string,Includes Chinese andEnglish";
$start = 0;
$length = 15;

echo safe_mb_strcut($str, $start, $length, 'UTF-8');
?>

If you want to learn more about PHP string processing, you can access the following resources:

 $url = "https://gitbox.net/php/manual/zh/function.mb-strcut.php";