Effective Methods and Tips for Extracting Chinese Strings in PHP

gitbox 2025-07-28

Common Methods for Extracting Chinese Strings in PHP

PHP, as a widely-used server-side language, often requires substring operations. Because Chinese characters occupy multiple bytes, traditional string functions cannot handle them correctly. This article shares practical methods to help you easily extract Chinese strings.

Using mb_substr() Function to Extract Chinese Strings

PHP’s built-in mb_substr() function is designed for multibyte characters and can accurately extract Chinese substrings. It counts characters rather than bytes.

Function Usage Example:

<span class="fun">mb_substr(string $str, int $start, int $length, string $charset = 'UTF-8'): string</span>

Parameter explanation:

$str: The string to be extracted.
$start: Starting position based on character index, starting at 0.
$length: Length to extract.
$charset: Character encoding, default is UTF-8.

Example code:

$str = "PHP中截取中文字符串";
$substr = mb_substr($str, 0, 6);
echo $substr;  // Outputs: "PHP中截"

Extracting Chinese Characters with Regular Expressions

Besides built-in functions, regular expressions combined with preg_match() can extract Chinese characters. Regex can match Unicode ranges for Chinese characters.

Example code:

$str = "PHP中截取中文字符串";
preg_match('/^([\x{4e00}-\x{9fa5}]+)/u', $str, $matches);
echo $matches[1];  // Outputs: "PHP"

Custom Function to Extract Chinese Strings

For more flexible logic, you can write a custom function combining mb_strlen() and mb_substr() to extract substrings with ellipsis.

function substr_cn($str, $length) {
    $len = mb_strlen($str, 'UTF-8');
    if ($len > $length) {
        $str = mb_substr($str, 0, $length, 'UTF-8') . '...';
    }
    return $str;
}

$str = "PHP中截取中文字符串";
$substr = substr_cn($str, 6);
echo $substr;  // Outputs: "PHP中截..."

Important Notes

When using these methods, please pay attention to:

Ensure correct character encoding. Use mb_internal_encoding('UTF-8'); to set default encoding to avoid extraction errors due to encoding mismatch.
Chinese characters are multibyte, so avoid single-byte functions like strlen(). Use mb_strlen() for accurate length calculation.
Substrings should be based on character positions, not bytes, to prevent truncating characters and causing garbled text.

Summary

Extracting Chinese strings is a common and important task in PHP development. Using the built-in mb_substr() function, regular expressions, or custom functions, you can flexibly and efficiently extract Chinese substrings. Proper handling of character encoding and length is key to accurate results. We hope this article offers practical guidance for your projects.