当前位置: 首页> 最新文章列表> strcmp 函数会出现乱码?揭秘原因及如何避免字符编码问题

strcmp 函数会出现乱码?揭秘原因及如何避免字符编码问题

gitbox 2025-09-11

当你在 PHP 里用 strcmp() 比较包含中文、表情或带音调的拉丁字母时,时不时会遇到“比较结果不对”“显示成问号/小方块”“排序奇怪”等现象。很多人把这统称为“乱码”。其实,strcmp() 本身没有“会把字符弄乱”的能力——它只是比较两个字符串的“原始二进制”。问题往往出在字符编码不一致文本规范化不同用错了比较工具上。

下面把关键原因与可操作的解决方案一次讲清。

一、strcmp() 到底做了什么?

  • 二进制安全、区分大小写strcmp($a, $b) 按字节序比较 $a$b,返回 <0 / 0 / >0。它不了解 UTF-8、GBK、emoji,也不懂“字母排序规则”,更不会做大小写折叠或重音处理。

  • 结论:如果两个字符串使用不同编码、或同是 UTF-8 但字节序列不同(例如带 BOM、不同规范化形式),strcmp() 的结果就会“看起来不合理”。

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"a"</span></span><span>, </span><span><span class="hljs-string">"b"</span></span><span>));   </span><span><span class="hljs-comment">// int(-1) 正常:a &lt; b</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"汉"</span></span><span>, </span><span><span class="hljs-string">"字"</span></span><span>)); </span><span><span class="hljs-comment">// 比较的是 UTF-8 的字节序,结果未必符合“汉字拼音顺序”</span></span><span>
</span></span>

二、常见“乱码/比较异常”的四大根因

  1. 编码不一致(UTF-8 vs GBK 等)
    同样是“中文”,一个是 UTF-8,一个是 GBK,字节完全不同。strcmp() 只看字节,当然“错位”。

  2. UTF-8 BOM 与不可见字符
    文件头或输入里混入 BOM (EF BB BF)、零宽空格(ZWSP)、不可见控制字符,会让首字节/尾字节不同,导致比较“离谱”。

  3. 规范化差异(NFC/NFD)
    é 可以是单一字符(NFC)或“e + 组合重音”(NFD)。人眼相同,字节不同,strcmp() 判不相等或排序异常。

  4. 期望“人类排序/语言规则”,却用字节比较
    想按中文拼音、德语 ?、法语重音、日文假名规则排序?strcmp() 不懂这些,需要区域化比较工具。

三、如何避免:一套可落地的对策

1) 统一为 UTF-8(无 BOM)

  • 入口统一:数据库连接、HTTP 头、模板文件、CLI 环境,全链路设为 UTF-8

  • 移除 BOM/控制字符

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">strip_bom_and_controls</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-keyword">string</span></span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>): </span><span><span class="hljs-title">string</span></span><span> {
    </span><span><span class="hljs-comment">// 去 BOM</span></span><span>
    </span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/^\xEF\xBB\xBF/'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
    </span><span><span class="hljs-comment">// 去常见零宽字符:ZWSP, ZWNJ, ZWJ, NBSP …</span></span><span>
    </span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/[\x{200B}\x{200C}\x{200D}\x{00A0}]/u'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>;
}
</span></span>
  • 必要时转换

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$clean</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$input</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>, </span><span><span class="hljs-string">'UTF-8,GBK,GB2312,BIG5,ISO-8859-1'</span></span><span>);
</span></span>

2) 进行 Unicode 规范化

  • 安装/启用 intl 扩展,使用 Normalizer 统一到 NFC:

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">class_exists</span></span><span>(</span><span><span class="hljs-string">'Normalizer'</span></span><span>)) {
    </span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
    </span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
}
</span></span>

3) 选择“正确的比较函数”

  • 仍然只需“字节一致性”:用 strcmp();或想不区分大小写的字节比较,用 strcasecmp()(同样按字节、ASCII 规则)。

  • 需要人类可读/语言规则的比较(排序/去重/查找):用 Intl\Collator(区域化比较,懂重音、大小写、变体)。

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN'</span></span><span>);          </span><span><span class="hljs-comment">// 或 'zh-Hans-CN', 'en_US', 'de_DE' 等</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">SECONDARY</span></span><span>); </span><span><span class="hljs-comment">// 忽略大小写但区分重音等</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-string">'汉'</span></span><span>, </span><span><span class="hljs-string">'字'</span></span><span>));     </span><span><span class="hljs-comment">// -1/0/1,基于语言学规则</span></span><span>
</span></span>
  • 需要“用户视觉上的字符数量/截断/遍历”:用 grapheme_* 系列(同 intl)处理“字素簇”,避免把一个 emoji 切成半个:

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$text</span></span><span> = </span><span><span class="hljs-string">"?????开发"</span></span><span>; </span><span><span class="hljs-comment">// 含 ZWJ 连接符的 emoji</span></span><span>
</span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-title function_ invoke__">grapheme_substr</span></span><span>(</span><span><span class="hljs-variable">$text</span></span><span>, </span><span><span class="hljs-number">0</span></span><span>, </span><span><span class="hljs-number">2</span></span><span>); </span><span><span class="hljs-comment">// ?????开</span></span><span>
</span></span>
<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$aFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-variable">$bFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-variable">$aFold</span></span><span>, </span><span><span class="hljs-variable">$bFold</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>);
</span></span>

4) 排序与查重的示例实践

按中文拼音排序(示意)

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$names</span></span><span> = [</span><span><span class="hljs-string">'张三'</span></span><span>, </span><span><span class="hljs-string">'李四'</span></span><span>, </span><span><span class="hljs-string">'王五'</span></span><span>, </span><span><span class="hljs-string">'阿里'</span></span><span>, </span><span><span class="hljs-string">'曹操'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN@collation=pinyin'</span></span><span>); </span><span><span class="hljs-comment">// 要求系统 ICU 支持</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">sort</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span></span>

忽略大小写与重音的“人类去重”

<span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$input</span></span><span> = [</span><span><span class="hljs-string">'café'</span></span><span>, </span><span><span class="hljs-string">'Cafe'</span></span><span>, </span><span><span class="hljs-string">'CAFé'</span></span><span>, </span><span><span class="hljs-string">'cafe'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'fr_FR'</span></span><span>);
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">PRIMARY</span></span><span>); </span><span><span class="hljs-comment">// 忽略重音与大小写</span></span><span>
</span><span><span class="hljs-variable">$unique</span></span><span> = [];
</span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$input</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>) {
    </span><span><span class="hljs-variable">$sN</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$s</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
    </span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">false</span></span><span>;
    </span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$unique</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$u</span></span><span>) {
        </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-variable">$sN</span></span><span>, </span><span><span class="hljs-variable">$u</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>) { </span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">true</span></span><span>; </span><span><span class="hljs-keyword">break</span></span><span>; }
    }
    </span><span><span class="hljs-keyword">if</span></span><span> (!</span><span><span class="hljs-variable">$found</span></span><span>) </span><span><span class="hljs-variable">$unique</span></span><span>[] = </span><span><span class="hljs-variable">$sN</span></span><span>;
}
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$unique</span></span><span>); </span><span><span class="hljs-comment">// 只保留一个变体</span></span><span>
</span></span>

四、快速排错清单(遇到“乱码”先过一遍)

  1. 确认编码mb_detect_encoding($s, ['UTF-8','GBK','BIG5','ISO-8859-1'], true)。若不确定,先转 UTF-8。

  2. 去 BOM/零宽字符:见上 strip_bom_and_controls()

  3. 规范化到 NFCNormalizer::normalize()

  4. 明确比较目标

    • 字节一致性:strcmp/strcasecmp

    • 语言规则排序/相等:Intl\Collator

    • 视觉字符级处理:grapheme_*

  5. 数据库与 HTTP 头一致:MySQL 用 utf8mb4 与合适的 collation(如 utf8mb4_0900_ai_ci),HTTP 设 Content-Type: text/html; charset=UTF-8

五、FAQ:几个典型“坑”

  • “同样是 UTF-8,为什么 strcmp() 还不相等?”
    可能混入了 BOM、零宽字符,或一个是 NFC、一个是 NFD。先清洗 + 规范化。

  • strcasecmp() 能做国际化不区分大小写吗?”
    它的折叠主要是 ASCII 语义。更可靠做法:mb_strtolower() 后再比较,或用 Collator 的合适强度。

  • “emoji/合字被截断或计数异常”
    使用 grapheme_strlen/grapheme_substr,不要用 strlen/substr 处理用户可见字符。