Simple Method to Extract Only Chinese Characters Using PHP

gitbox 2025-08-02

What is Chinese?

Chinese is a language spoken in mainland China, Taiwan, Singapore, Malaysia, and other regions. It is written using Chinese characters, which are composed of unique and complex strokes and radicals. The grammar structure is relatively simple, mainly relying on word types and word order to convey meaning.

Chinese Character Encoding

In computing, Chinese characters need to be encoded for storage and processing. The common encoding methods include:

GB2312 Encoding

Unicode Encoding

GB2312 is a double-byte encoding covering about 6763 Chinese characters, including commonly used ones and other symbols. Unicode encoding includes all known characters worldwide, with Chinese characters usually represented by two bytes.

How to Extract Only Chinese Characters?

In PHP, you can use regular expressions to match Chinese characters in the Unicode range, effectively keeping only Chinese characters.

// Remove non-Chinese characters from text
function remove_non_chinese($text) {
  // Keep only Chinese characters
  $pattern = '/[\x{4e00}-\x{9fa5}]+/u';
  return preg_replace($pattern, '', $text);
}

The code above uses the Unicode range \x{4e00}-\x{9fa5} to match Chinese characters. The u flag indicates Unicode mode.

Example

$text = 'Hello, 你好，我是一个 PHP 开发者。';

Using the function to remove non-Chinese characters:

$chinese_only = remove_non_chinese($text);
echo $chinese_only; // Output: 你好我是一个PHP开发者

The output shows that English letters and spaces are successfully removed, leaving only Chinese characters.

Usage Notes

This method mainly matches simplified Chinese characters and has limited support for traditional characters. Also, punctuation marks like periods and commas will be removed. You may need to adjust the regular expression based on your specific application scenario to meet your needs.