當你在PHP 裡用strcmp()比較包含中文、表情或帶音調的拉丁字母時,時不時會遇到“比較結果不對”“顯示成問號/小方塊”“排序奇怪”等現象。很多人把這統稱為“亂碼”。其實, strcmp()本身沒有“會把字符弄亂”的能力——它只是比較兩個字符串的“原始二進制”。問題往往出在字符編碼不一致、文本規範化不同或用錯了比較工具上。
下面把關鍵原因與可操作的解決方案一次講清。
二進制安全、區分大小寫: strcmp($a, $b)按字節序比較$a和$b ,返回<0 / 0 / >0 。它不了解UTF-8、GBK、emoji ,也不懂“字母排序規則”,更不會做大小寫折疊或重音處理。
結論:如果兩個字符串使用不同編碼、或同是UTF-8 但字節序列不同(例如帶BOM、不同規範化形式), strcmp()的結果就會“看起來不合理”。
<span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"a"</span></span><span>, </span><span><span class="hljs-string">"b"</span></span><span>)); </span><span><span class="hljs-comment">// int(-1) 正常:a < b</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"漢"</span></span><span>, </span><span><span class="hljs-string">"字"</span></span><span>)); </span><span><span class="hljs-comment">// 比較的是 UTF-8 的字節序,結果未必符合“漢字拼音顺序”</span></span><span>
</span></span>編碼不一致(UTF-8 vs GBK 等)
同樣是“中文”,一個是UTF-8,一個是GBK,字節完全不同。 strcmp()只看字節,當然“錯位”。
UTF-8 BOM 與不可見字符<br> 文件頭或輸入裡混入BOM (EF BB BF ) 、零寬空格(ZWSP)、不可見控製字符,會讓首字節/尾字節不同,導致比較“離譜”
規範化差異(NFC/NFD)
é可以是單一字符(NFC)或“e + 組合重音”(NFD)。人眼相同,字節不同, strcmp()判不相等或排序異常。
期望“人類排序/語言規則”,卻用字節比較<br> 想按中文拼音、德語?、法語重音、日文假名規則排序strcmp()不懂這些,需要區域化比較工具。
入口統一:數據庫連接、HTTP 頭、模板文件、CLI 環境,全鏈路設為UTF-8 。
移除BOM/控製字符:
<span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">strip_bom_and_controls</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-keyword">string</span></span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>): </span><span><span class="hljs-title">string</span></span><span> {
</span><span><span class="hljs-comment">// 去 BOM</span></span><span>
</span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/^\xEF\xBB\xBF/'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
</span><span><span class="hljs-comment">// 去常見零寬字符:ZWSP, ZWNJ, ZWJ, NBSP …</span></span><span>
</span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/[\x{200B}\x{200C}\x{200D}\x{00A0}]/u'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
</span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>;
}
</span></span>必要時轉換:
<span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-variable">$clean</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$input</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>, </span><span><span class="hljs-string">'UTF-8,GBK,GB2312,BIG5,ISO-8859-1'</span></span><span>);
</span></span>安裝/啟用intl擴展,使用Normalizer統一到NFC:
<span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">class_exists</span></span><span>(</span><span><span class="hljs-string">'Normalizer'</span></span><span>)) {
</span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
}
</span></span>仍然只需“字節一致性” :用strcmp() ;或想不區分大小寫的字節比較,用strcasecmp() (同樣按字節、ASCII 規則)。
需要人類可讀/語言規則的比較(排序/去重/查找) :用Intl\Collator (區域化比較,懂重音、大小寫、變體)。
<span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN'</span></span><span>); </span><span><span class="hljs-comment">// 或 'zh-Hans-CN', 'en_US', 'de_DE' 等</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-></span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">SECONDARY</span></span><span>); </span><span><span class="hljs-comment">// 忽略大小寫但區分重音等</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-variable">$coll</span></span><span>-></span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-string">'漢'</span></span><span>, </span><span><span class="hljs-string">'字'</span></span><span>)); </span><span><span class="hljs-comment">// -1/0/1,基於語言學規則</span></span><span>
</span></span>需要“用戶視覺上的字符數量/截斷/遍歷” :用grapheme_*系列(同intl )處理“字素簇”,避免把一個emoji 切成半個:
<span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-variable">$text</span></span><span> = </span><span><span class="hljs-string">"?????開發"</span></span><span>; </span><span><span class="hljs-comment">// 含 ZWJ 連接符的 emoji</span></span><span>
</span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-title function_ invoke__">grapheme_substr</span></span><span>(</span><span><span class="hljs-variable">$text</span></span><span>, </span><span><span class="hljs-number">0</span></span><span>, </span><span><span class="hljs-number">2</span></span><span>); </span><span><span class="hljs-comment">// ?????開</span></span><span>
</span></span>需要忽略大小寫的多字節比較:用mb_strtolower或mb_convert_case後再比較(記得先規範化):
<span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$aFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-variable">$bFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-variable">$aFold</span></span><span>, </span><span><span class="hljs-variable">$bFold</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>);
</span></span> <span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-variable">$names</span></span><span> = [</span><span><span class="hljs-string">'張三'</span></span><span>, </span><span><span class="hljs-string">'李四'</span></span><span>, </span><span><span class="hljs-string">'王五'</span></span><span>, </span><span><span class="hljs-string">'阿里'</span></span><span>, </span><span><span class="hljs-string">'曹操'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN@collation=pinyin'</span></span><span>); </span><span><span class="hljs-comment">// 要求系統 ICU 支持</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-></span><span><span class="hljs-title function_ invoke__">sort</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span></span> <span><span><span class="hljs-meta"><?php</span></span><span>
</span><span><span class="hljs-variable">$input</span></span><span> = [</span><span><span class="hljs-string">'café'</span></span><span>, </span><span><span class="hljs-string">'Cafe'</span></span><span>, </span><span><span class="hljs-string">'CAFé'</span></span><span>, </span><span><span class="hljs-string">'cafe'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'fr_FR'</span></span><span>);
</span><span><span class="hljs-variable">$coll</span></span><span>-></span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">PRIMARY</span></span><span>); </span><span><span class="hljs-comment">// 忽略重音與大小寫</span></span><span>
</span><span><span class="hljs-variable">$unique</span></span><span> = [];
</span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$input</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>) {
</span><span><span class="hljs-variable">$sN</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$s</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">false</span></span><span>;
</span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$unique</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$u</span></span><span>) {
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$coll</span></span><span>-></span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-variable">$sN</span></span><span>, </span><span><span class="hljs-variable">$u</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>) { </span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">true</span></span><span>; </span><span><span class="hljs-keyword">break</span></span><span>; }
}
</span><span><span class="hljs-keyword">if</span></span><span> (!</span><span><span class="hljs-variable">$found</span></span><span>) </span><span><span class="hljs-variable">$unique</span></span><span>[] = </span><span><span class="hljs-variable">$sN</span></span><span>;
}
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$unique</span></span><span>); </span><span><span class="hljs-comment">// 只保留一個變體</span></span><span>
</span></span>確認編碼: mb_detect_encoding($s, ['UTF-8','GBK','BIG5','ISO-8859-1'], true) 。若不確定,先轉UTF-8。
去BOM/零寬字符:見上strip_bom_and_controls() 。
規範化到NFC : Normalizer::normalize() 。
明確比較目標:
字節一致性: strcmp/strcasecmp 。
語言規則排序/相等: Intl\Collator 。
視覺字符級處理: grapheme_* 。
數據庫與HTTP 頭一致:MySQL 用utf8mb4與合適的collation (如utf8mb4_0900_ai_ci ),HTTP 設Content-Type: text/html; charset=UTF-8 。
“同樣是UTF-8,為什麼strcmp()還不相等?”
可能混入了BOM、零寬字符,或一個是NFC、一個是NFD。先清洗+ 規範化。
“ strcasecmp()能做國際化不區分大小寫嗎?”
它的折疊主要是ASCII 語義。更可靠做法: mb_strtolower()後再比較,或用Collator的合適強度。
“emoji/合字被截斷或計數異常”
使用grapheme_strlen/grapheme_substr ,不要用strlen/substr處理用戶可見字符。