Current Location: Home> Latest Articles> Byte and character issues in mb_strcut, the difference you have to know

Byte and character issues in mb_strcut, the difference you have to know

gitbox 2025-05-26

In PHP development, processing multibyte strings is a common and error-prone link. Especially when it comes to intercepting strings, the mb_strcut function is often used instead of substr to avoid garbled problems when intercepting multibyte characters. However, many developers have doubts about the difference between bytes and characters in mb_strcut . This article will analyze the differences between the two in detail to help you better understand and use the function.

1. Introduction to mb_strcut function

mb_strcut is a function in the PHP multibyte string function library mbstring , which is used to intercept part of a string.

 string mb_strcut ( string $str , int $start [, int $length = NULL [, string $encoding = mb_internal_encoding() ]] )
  • $str : Enter string

  • $start : the starting position, unit is byte (byte)

  • $length : intercepts the length, the unit is also bytes (optional)

  • $encoding : string encoding, default to internal encoding

2. The difference between bytes and characters

  • Byte : The basic unit of data storage in a computer, 1 byte = 8 bits. A byte can represent an English character, but for Chinese characters or other multi-byte characters, multiple bytes are often required.

  • Character : refers to a complete "symbol", regardless of how many bytes it occupies.

For example, in UTF-8 encoding, a Chinese character usually accounts for 3 bytes, while an English character accounts for 1 byte.

3. mb_strcut intercepts in bytes

The key point of mb_strcut is that its $start and $length parameters are both in bytes, which is different from other functions (such as mb_substr ), which are in characters.

This means that if you want to intercept 5 characters starting from the third character, using mb_strcut requires calculating the number of bytes occupied by each character. Using character indexing directly will lead to interception errors, and even intercepting half a multi-byte character, causing garbled code.

4. Why use mb_strcut?

The advantage of mb_strcut is that it ensures that the middle part of the multi-byte character is not truncated. When intercepting, mb_strcut will automatically adjust the boundary to avoid truncating part of the characters and preventing garbled code from being output.

For example:

 <?php
$str = "Hello,world!"; // "Hello"Two Chinese characters,The English and exclamation marks are followed
echo mb_strcut($str, 0, 6, "UTF-8"); 
?>

In the above code, the 6 byte length is exactly the number of bytes of the two Chinese characters "you" and "good" (3 bytes per Chinese character). mb_strcut will correctly intercept these two Chinese characters without cutting out half of the character.

If you use substr or intercept function in characters, bytes may be truncated to cause garbled code.

5. Calculation example of byte units

After understanding the byte units of mb_strcut , we can use mb_strlen and mb_substr to assist in calculating the number of bytes. For example:

 <?php
$str = "Hello,world!";
$encoding = "UTF-8";
for ($i = 0; $i < mb_strlen($str, $encoding); $i++) {
    $char = mb_substr($str, $i, 1, $encoding);
    $byteLen = strlen(mb_convert_encoding($char, "UTF-8", $encoding));
    echo "character {$char} Number of bytes occupied: {$byteLen}\n";
}
?>

Output:

 character you Number of bytes occupied: 3
character good Number of bytes occupied: 3
character , Number of bytes occupied: 3
character w Number of bytes occupied: 1
character o Number of bytes occupied: 1
character r Number of bytes occupied: 1
character l Number of bytes occupied: 1
character d Number of bytes occupied: 1
character ! Number of bytes occupied: 3

This indicates that the multibyte character occupies multiple bytes in UTF-8.

6. Choose mb_strcut or mb_substr?

  • If you want to truncate strings based on byte length and prevent multi-byte characters from being truncated in the middle, you should use mb_strcut .

  • If you want to intercept a string based on the number of characters (regardless of how many bytes each character takes), you should use mb_substr .

7. Things to note

  • Be sure to specify the correct encoding, otherwise the byte calculation may be errors.

  • In network transmission, database storage or file operations, the byte length of strings is often more important than the length of character, and mb_strcut is very practical at this time.

  • If you are not familiar with the difference between bytes and characters, you are prone to truncation exceptions and garbled code problems.


 <?php
// Sample code:usemb_strcutInterceptUTF-8编码character串的前6Bytes(对应Two Chinese characters)
$str = "Hello,world!";
$cutStr = mb_strcut($str, 0, 6, "UTF-8");
echo $cutStr; // Output "Hello"
?>