How does PHP's strnatcasecmp function perform in Chinese character sorting? What problems exist?

gitbox 2025-05-27

strnatcasecmp is a built-in function in PHP. Its function is to compare two strings and sort them in "natural order". The so-called natural order refers to sorting literally by numerical and character order, rather than by ASCII code value of characters. For example:

 $str1 = 'a10';
$str2 = 'a2';
echo strnatcasecmp($str1, $str2); // Output 1，because 'a10' It should be ranked 'a2' later

The advantage of this function is that it can correctly handle string sorting containing numbers, which is superior to traditional string comparison functions such as strcmp .

2. The sorting of Chinese characters

When dealing with English characters, strnatcasecmp 's performance is usually satisfactory. However, when it comes to Chinese characters, the performance of this function begins to have some problems. strnatcasecmp does not consider the language and encoding of characters, but only compares according to the literal order of characters. Chinese characters are usually multi-byte characters, which makes strnatcasecmp unable to sort reasonably like English characters.

2.1 The encoding difference of Chinese characters

The sorting problem of Chinese characters is first closely related to the encoding method. PHP's default character encoding is usually UTF-8, but if Chinese characters are encoded differently (such as GB2312 or GBK), strnatcasecmp will compare based on the byte representation of the characters. This causes Chinese characters with different encodings to exhibit exceptions when sorting.

For example:

 $str1 = 'apple';
$str2 = 'banana';
echo strnatcasecmp($str1, $str2); // Output一个不一定符合自然排序的结果

Even if we use UTF-8 encoding, this byte-level comparison will not get ideal results, because strnatcasecmp cannot understand the semantics or sorting rules of characters during the comparison process.

2.2 Processing of multi-byte characters

Another problem with strnatcasecmp is its lack of handling multibyte characters. Since Chinese characters are usually composed of multiple bytes, PHP's default string functions (such as strnatcasecmp ) do not take into account the actual sorting rules of multibyte characters. For example, some Chinese characters may not conform to our daily sorting habits in encoding order, resulting in deviations in sorting results.

3. Why does strnatcasecmp not sort Chinese inaccurately?

strnatcasecmp does not take into account the linguistic properties of characters, but simply compares in byte order. For English characters, such comparison methods are usually valid, but for Chinese characters, byte sorting does not conform to actual language sorting rules. Specifically:

The byte order of Chinese characters is different from the natural language sorting rules : the byte value of Chinese characters is usually greater than that of English characters, which may cause the sorting results of Chinese characters to not meet conventional expectations.
The influence of multi-byte characters : Chinese characters usually occupy multiple bytes, while strnatcasecmp does not specifically process these bytes, resulting in deviations in sorting.
Semantic differences of characters : Chinese characters are not only different in bytes, but also in semantic order of order of the alphabet. strnatcasecmp is just compared bytes and cannot reflect the actual relationship between Chinese characters.

4. Solution

For the sorting of Chinese characters, it is recommended to use a special Chinese sorting function or enhance the processing of strnatcasecmp .

4.1 Using the collarator_compare function

PHP provides the Collator class, which supports language and region-based sorting rules. When dealing with Chinese characters, using the Collator class to sort is a more appropriate choice. Here is an example of using the Collator class for Chinese sorting:

 $collator = collator_create('zh_CN'); // Create a sorting rule for Chinese regions
$str1 = 'apple';
$str2 = 'banana';
echo collator_compare($collator, $str1, $str2); // Output比较结果

In this way, collarator_compare will be sorted according to Chinese linguistic rules, avoiding the problem of strnatcasecmp on Chinese characters.

4.2 Extend with mbstring

If your PHP environment supports mbstring extensions, you can use mb_strtolower or mb_strtoupper to normalize the strings for more accurate comparisons. Combined with strnatcasecmp , the effect of Chinese sorting can be improved.

 $str1 = 'apple';
$str2 = 'banana';
echo strnatcasecmp(mb_strtolower($str1, 'UTF-8'), mb_strtolower($str2, 'UTF-8'));

Although this approach cannot completely solve the problem of Chinese sorting, in some cases it can provide more reasonable sorting results.

5. Summary

The strnatcasecmp function has certain limitations when dealing with Chinese characters, which is mainly reflected in the fact that the linguistic order of characters and the particularity of multi-byte characters are not considered. For Chinese sorting, using the Collator class to sort is a more accurate and recommended way. By adopting tools and methods that are more suitable for Chinese sorting, strnatcasecmp can effectively avoid the problems encountered by Chinese character sorting, thereby improving the stability and user experience of the program.