When processing strings in PHP, if multi-byte characters such as Chinese, Japanese, and Korean are involved, using conventional string functions (such as substr ) is prone to character truncation errors, resulting in garbled or incomplete characters. To avoid this problem, PHP provides multi-byte string extension mbstring , two very practical functions are mb_get_info and mb_substr .
This article will use examples to explain how to combine these two functions to safely and correctly intercept multi-byte strings.
mb_get_info() is used to obtain the current multibyte environment configuration information. Through it, we can know what the internal encoding is currently in use, thus ensuring that the encoding is consistent when performing string operations.
<?php
$info = mb_get_info();
echo "The currently used multibyte encoding is:" . $info['internal_encoding'];
?>
Generally, it is recommended to set the encoding explicitly at the beginning of the script to avoid problems caused by default settings:
<?php
mb_internal_encoding('UTF-8'); // Set as UTF-8
?>
mb_substr() is a multi-byte version of substr() , used to intercept substrings of a specified length from a string, supports multiple character encodings, and avoids characters being incorrectly one-size-fits-all.
The syntax is as follows:
mb_substr(string $string, int $start, ?int $length = null, ?string $encoding = null): string
Parameter explanation:
$string : original string
$start : Start position (start from 0)
$length : optional, intercepted length
$encoding : optional, specify the encoding (it is recommended to write it out clearly)
Suppose we cut the first 50 characters from a UTF-8-encoded Chinese article as a summary:
<?php
mb_internal_encoding('UTF-8'); // Identify the encoding
$article = "PHP It is a widely used open source multi-purpose scripting language,Especially suitable for Web Developed and embeddable HTML middle。";
// Before intercept 50 Characters
$summary = mb_substr($article, 0, 50);
echo "Article summary:" . $summary;
?>
The output results will not be garbled, because mb_substr will be processed in characters, not in bytes.
For example, when a user submits a comment, we want to display only the first 30 characters in the display list and provide the "Read Full Text" link:
<?php
mb_internal_encoding('UTF-8');
$comment = "This is a very exciting comment submitted by the user,We want only some of the content to be displayed。";
$preview = mb_substr($comment, 0, 30);
echo $preview . '... <a href="https://gitbox.net/full-comment.php?id=123">Read the full text</a>';
?>
Doing this not only prevents the page from becoming bloated due to excessively long content, but also ensures the complete display of characters.
If you are dealing with other encodings such as GBK, BIG5, etc., remember to specify the encoding parameters explicitly in each mb_ function.
mb_strlen() can also be used in conjunction to determine whether it is necessary before intercepting (for example, if there are only 20 characters, there is no need to intercept 30).
When outputting intercepted content in HTML environment, you should also pay attention to escape to avoid XSS problems.