當前位置: 首頁> 最新文章列表> strcmp 函數會出現亂碼?揭秘原因及如何避免字符編碼問題

strcmp 函數會出現亂碼?揭秘原因及如何避免字符編碼問題

gitbox 2025-09-11

當你在PHP 裡用strcmp()比較包含中文、表情或帶音調的拉丁字母時,時不時會遇到“比較結果不對”“顯示成問號/小方塊”“排序奇怪”等現象。很多人把這統稱為“亂碼”。其實, strcmp()本身沒有“會把字符弄亂”的能力——它只是比較兩個字符串的“原始二進制”。問題往往出在字符編碼不一致文本規範化不同用錯了比較工具上。

下面把關鍵原因與可操作的解決方案一次講清。

一、 strcmp()到底做了什麼?

  • 二進制安全、區分大小寫strcmp($a, $b)按字節序比較$a$b ,返回<0 / 0 / >0 。它不了解UTF-8、GBK、emoji ,也不懂“字母排序規則”,更不會做大小寫折疊或重音處理。

  • 結論:如果兩個字符串使用不同編碼、或同是UTF-8 但字節序列不同(例如帶BOM、不同規範化形式), strcmp()的結果就會“看起來不合理”。

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"a"</span></span><span>, </span><span><span class="hljs-string">"b"</span></span><span>));   </span><span><span class="hljs-comment">// int(-1) 正常:a &lt; b</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"漢"</span></span><span>, </span><span><span class="hljs-string">"字"</span></span><span>)); </span><span><span class="hljs-comment">// 比較的是 UTF-8 的字節序,結果未必符合“漢字拼音顺序”</span></span><span>
</span></span>

二、常見“亂碼/比較異常”的四大根因

  1. 編碼不一致(UTF-8 vs GBK 等)
    同樣是“中文”,一個是UTF-8,一個是GBK,字節完全不同。 strcmp()只看字節,當然“錯位”。

  2. UTF-8 BOM 與不可見字符<br> 文件頭或輸入裡混入BOM (EF BB BF ) 、零寬空格(ZWSP)、不可見控製字符,會讓首字節/尾字節不同,導致比較“離譜”

  3. 規範化差異(NFC/NFD)
    é可以是單一字符(NFC)或“e + 組合重音”(NFD)。人眼相同,字節不同, strcmp()判不相等或排序異常。

  4. 期望“人類排序/語言規則”,卻用字節比較<br> 想按中文拼音、德語?、法語重音、日文假名規則排序strcmp()不懂這些,需要區域化比較工具。

三、如何避免:一套可落地的對策

1) 統​​一為UTF-8(無BOM)

  • 入口統一:數據庫連接、HTTP 頭、模板文件、CLI 環境,全鏈路設為UTF-8

  • 移除BOM/控製字符

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">strip_bom_and_controls</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-keyword">string</span></span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>): </span><span><span class="hljs-title">string</span></span><span> {
    </span><span><span class="hljs-comment">// 去 BOM</span></span><span>
    </span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/^\xEF\xBB\xBF/'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
    </span><span><span class="hljs-comment">// 去常見零寬字符:ZWSP, ZWNJ, ZWJ, NBSP …</span></span><span>
    </span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/[\x{200B}\x{200C}\x{200D}\x{00A0}]/u'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>;
}
</span></span>
  • 必要時轉換

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$clean</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$input</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>, </span><span><span class="hljs-string">'UTF-8,GBK,GB2312,BIG5,ISO-8859-1'</span></span><span>);
</span></span>

2) 進行Unicode 規範化

  • 安裝/啟用intl擴展,使用Normalizer統一到NFC:

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">class_exists</span></span><span>(</span><span><span class="hljs-string">'Normalizer'</span></span><span>)) {
    </span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
    </span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
}
</span></span>

3) 選擇“正確的比較函數”

  • 仍然只需“字節一致性” :用strcmp() ;或想不區分大小寫的字節比較,用strcasecmp() (同樣按字節、ASCII 規則)。

  • 需要人類可讀/語言規則的比較(排序/去重/查找) :用Intl\Collator (區域化比較,懂重音、大小寫、變體)。

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN'</span></span><span>);          </span><span><span class="hljs-comment">// 或 'zh-Hans-CN', 'en_US', 'de_DE' 等</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">SECONDARY</span></span><span>); </span><span><span class="hljs-comment">// 忽略大小寫但區分重音等</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-string">'漢'</span></span><span>, </span><span><span class="hljs-string">'字'</span></span><span>));     </span><span><span class="hljs-comment">// -1/0/1,基於語言學規則</span></span><span>
</span></span>
  • 需要“用戶視覺上的字符數量/截斷/遍歷” :用grapheme_*系列(同intl )處理“字素簇”,避免把一個emoji 切成半個:

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$text</span></span><span> = </span><span><span class="hljs-string">"?????開發"</span></span><span>; </span><span><span class="hljs-comment">// 含 ZWJ 連接符的 emoji</span></span><span>
</span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-title function_ invoke__">grapheme_substr</span></span><span>(</span><span><span class="hljs-variable">$text</span></span><span>, </span><span><span class="hljs-number">0</span></span><span>, </span><span><span class="hljs-number">2</span></span><span>); </span><span><span class="hljs-comment">// ?????開</span></span><span>
</span></span>
 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$aFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-variable">$bFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-variable">$aFold</span></span><span>, </span><span><span class="hljs-variable">$bFold</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>);
</span></span>

4) 排序與查重的示例實踐

按中文拼音排序(示意)

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$names</span></span><span> = [</span><span><span class="hljs-string">'張三'</span></span><span>, </span><span><span class="hljs-string">'李四'</span></span><span>, </span><span><span class="hljs-string">'王五'</span></span><span>, </span><span><span class="hljs-string">'阿里'</span></span><span>, </span><span><span class="hljs-string">'曹操'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN@collation=pinyin'</span></span><span>); </span><span><span class="hljs-comment">// 要求系統 ICU 支持</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">sort</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span></span>

忽略大小寫與重音的“人類去重”

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$input</span></span><span> = [</span><span><span class="hljs-string">'café'</span></span><span>, </span><span><span class="hljs-string">'Cafe'</span></span><span>, </span><span><span class="hljs-string">'CAFé'</span></span><span>, </span><span><span class="hljs-string">'cafe'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'fr_FR'</span></span><span>);
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">PRIMARY</span></span><span>); </span><span><span class="hljs-comment">// 忽略重音與大小寫</span></span><span>
</span><span><span class="hljs-variable">$unique</span></span><span> = [];
</span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$input</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>) {
    </span><span><span class="hljs-variable">$sN</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$s</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
    </span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">false</span></span><span>;
    </span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$unique</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$u</span></span><span>) {
        </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-variable">$sN</span></span><span>, </span><span><span class="hljs-variable">$u</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>) { </span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">true</span></span><span>; </span><span><span class="hljs-keyword">break</span></span><span>; }
    }
    </span><span><span class="hljs-keyword">if</span></span><span> (!</span><span><span class="hljs-variable">$found</span></span><span>) </span><span><span class="hljs-variable">$unique</span></span><span>[] = </span><span><span class="hljs-variable">$sN</span></span><span>;
}
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$unique</span></span><span>); </span><span><span class="hljs-comment">// 只保留一個變體</span></span><span>
</span></span>

四、快速排錯清單(遇到“亂碼”先過一遍)

  1. 確認編碼mb_detect_encoding($s, ['UTF-8','GBK','BIG5','ISO-8859-1'], true) 。若不確定,先轉UTF-8。

  2. 去BOM/零寬字符:見上strip_bom_and_controls()

  3. 規範化到NFCNormalizer::normalize()

  4. 明確比較目標

    • 字節一致性: strcmp/strcasecmp

    • 語言規則排序/相等: Intl\Collator

    • 視覺字符級處理: grapheme_*

  5. 數據庫與HTTP 頭一致:MySQL 用utf8mb4與合適的collation (如utf8mb4_0900_ai_ci ),HTTP 設Content-Type: text/html; charset=UTF-8

五、FAQ:幾個典型“坑”

  • “同樣是UTF-8,為什麼strcmp()還不相等?”
    可能混入了BOM、零寬字符,或一個是NFC、一個是NFD。先清洗+ 規範化。

  • strcasecmp()能做國際化不區分大小寫嗎?”
    它的折疊主要是ASCII 語義。更可靠做法: mb_strtolower()後再比較,或用Collator的合適強度。

  • “emoji/合字被截斷或計數異常”
    使用grapheme_strlen/grapheme_substr ,不要用strlen/substr處理用戶可見字符。