How to accurately detect multibyte string length using mb_get_info and mb_strlen?

gitbox 2025-05-11

When processing multibyte strings (such as Chinese, Japanese, Korean, etc.) in PHP, using standard string functions (such as strlen ) often leads to unexpected results. Because these functions are calculated in bytes, not in characters. At this time, we need to use functions in PHP's Multibyte String extension (mbstring), such as mb_strlen and mb_get_info , to achieve more accurate string operations.

This article will take you through the basic usage of mb_strlen and mb_get_info , and use examples to illustrate how they help you accurately detect the length of multibyte strings.

1. Why can’t strlen be used directly?

Let’s take a look at a simple example:

 $str = "Hello，world";
echo strlen($str);  // Output：15

This string has only 5 Chinese characters (including commas), but returns 15. This is because under UTF-8 encoding, a Chinese character usually takes up 3 bytes. strlen counts "byte number", not "character number".

If we want to get the true number of characters, we should use mb_strlen :

 echo mb_strlen($str);  // Output：5

This way we get the correct number of characters.

2. Use mb_strlen to accurately calculate character length

mb_strlen is a function designed specifically for multibyte characters, with the syntax as follows:

 int mb_strlen ( string $str [, string $encoding = mb_internal_encoding() ] )

$str : The string to measure the length
$encoding : optional, specify the encoding type, default to use the encoding returned by mb_internal_encoding()

Example:

 $str = "Welcome to visit https://gitbox.net";
$length = mb_strlen($str, 'UTF-8');
echo "The character length is：$length";

Output:

 The character length is：18

This correctly counts the "number of characters" in a mixed Chinese and English string, not the number of bytes.

3. How to use mb_get_info to obtain encoding information?

mb_get_info can help you understand the current mbstring configuration, especially the internal encoding method:

 $info = mb_get_info();
print_r($info);

Output example:

 Array
(
    [internal_encoding] => UTF-8
    [http_input] => pass
    [http_output] => pass
    [language] => neutral
    [encoding_translation] => 0
    ...
)

This tells us that the UTF-8 encoding is currently being used. If you find that the mb_strlen calculation results are inaccurate, it is helpful to check whether the internal encoding is set correctly.

You can also specify to return specific information:

 echo mb_get_info("internal_encoding");  // Output：UTF-8

4. Suggestions: Set the default encoding

To avoid problems, it is recommended to set the default multibyte encoding at the beginning of the script:

 mb_internal_encoding("UTF-8");

This ensures that functions such as mb_strlen , mb_substr and other functions are encoded in UTF-8 to process strings.