Position actuelle: Accueil> Derniers articles> La fonction STRCMP semblera-t-elle brouillée? Révéler les raisons et comment éviter les problèmes d'encodage des personnages

La fonction STRCMP semblera-t-elle brouillée? Révéler les raisons et comment éviter les problèmes d'encodage des personnages

gitbox 2025-09-11

Lorsque vous utilisez strcmp () pour comparer des lettres latines contenant chinois, émoticônes ou tons en PHP, vous rencontrerez des phénomènes tels que "le résultat de comparaison est incorrect", "l'affichage est un point d'interrogation / petit carré" et "tri étrange". Beaucoup de gens appellent collectivement cela "Code brouillé". En fait, StrCMP () lui-même n'a pas la capacité de "faire gâcher les personnages" - il compare simplement le "binaire brut" de deux cordes. Les problèmes résident souvent dans un codage de caractère incohérent , une normalisation de texte différente ou un mauvais outil de comparaison .

Ce qui suit explique les principales raisons et solutions opérationnelles à la fois.

1. Que fait exactement strcmp () ?

  • Sélection binaire, sensible à la casse : strcmp ($ a, $ b) compare $ a et $ b dans la commande octet, retournant <0/0 /> 0 . Il ne comprend pas UTF-8, GBK, Emoji , ni ne comprend les "règles de tri alphabétiques", ni le traitement de pliage ou d'accent.

  • Conclusion : Si deux chaînes utilisent des encodages différents ou sont les deux UTF-8 mais ont des séquences d'octets différentes (comme BOM, différentes formes normalisées), le résultat de StrCMP () sera "déraisonnable".

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"a"</span></span><span>, </span><span><span class="hljs-string">"b"</span></span><span>));   </span><span><span class="hljs-comment">// int(-1) normale:a &lt; b</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-string">"Chinois"</span></span><span>, </span><span><span class="hljs-string">"Personnage"</span></span><span>)); </span><span><span class="hljs-comment">// Relativement UTF-8 的Personnage节序,Le résultat peut ne pas être cohérent“ChinoisPersonnage拼音顺序”</span></span><span>
</span></span>

2. Les quatre principales causes du "code brouillé / anomalie comparatif" commun "

  1. Encodage incohérent (UTF-8 vs GBK, etc.)
    Le même "chinois", l'un est UTF-8 et l'autre est GBK, et les octets sont complètement différents. Strcmp () ne regarde que les octets, bien sûr "mal placé".

  2. UTF-8 BOM et personnages invisibles <br> L'insertion de BOM (EF BB BF) , de l'espace zéro-largeur (ZWSP) et des caractères de contrôle invisibles dans l'en-tête ou l'entrée de fichier rendra le premier octet / dernier octet différent, ce qui se traduit par un plus "scandaleux".

  3. Différences normalisées (NFC / NFD)
    L'é peut être un seul caractère (NFC) ou un "accent combiné E +" (NFD). Les yeux humains sont les mêmes, les octets sont différents, StrCMP () est jugé inégal ou trié anormalement.

  4. Attendez-vous à des «règles de tri / langue humain», mais utilisez la comparaison des octets <br> Vous voulez trier par chinois pinyin, allemand ?, accents français, règles de kanna japonaise? StrCMP () ne les comprend pas et a besoin d'un outil de comparaison régional .

3. Comment éviter: un ensemble de solutions qui peuvent être mises en œuvre

1) Unified à UTF-8 (pas de bom)

  • Entrée unifiée : connexion de base de données, en-tête HTTP, fichier de modèle, environnement CLI, le lien complet est défini sur UTF-8 .

  • Supprimer les caractères bom / contrôle :

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">strip_bom_and_controls</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-keyword">string</span></span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>): </span><span><span class="hljs-title">string</span></span><span> {
    </span><span><span class="hljs-comment">// aller BOM</span></span><span>
    </span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/^\xEF\xBB\xBF/'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
    </span><span><span class="hljs-comment">// aller常见零宽Personnage符:ZWSP, ZWNJ, ZWJ, NBSP …</span></span><span>
    </span><span><span class="hljs-variable">$s</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/[\x{200B}\x{200C}\x{200D}\x{00A0}]/u'</span></span><span>, </span><span><span class="hljs-string">''</span></span><span>, </span><span><span class="hljs-variable">$s</span></span><span>);
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>;
}
</span></span>
  • Convertir si nécessaire :

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$clean</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$input</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>, </span><span><span class="hljs-string">'UTF-8,GBK,GB2312,BIG5,ISO-8859-1'</span></span><span>);
</span></span>

2) Effectuer une normalisation Unicode

  • Installez / activez l'extension INTL et utilisez Normalizer pour l'unifier sur NFC:

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">class_exists</span></span><span>(</span><span><span class="hljs-string">'Normalizer'</span></span><span>)) {
    </span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
    </span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
}
</span></span>

3) Sélectionnez "Correction Fonction de comparaison"

  • Toujours juste "cohérence des octets" : utilisez strcmp () ; ou veulent une comparaison d'octet insensible à la casse, utilisez strcascmpMP () (même par octet, règles ASCII).

  • Comparaison des règles lisibles / lisibles humaines (tri / déduplication / découverte) : Utilisez Intl \ Collator (comparaison régionale, compréhension de l'accent, de la maîtrise en majuscule, de la majuscule et de la variante).

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN'</span></span><span>);          </span><span><span class="hljs-comment">// ou 'zh-Hans-CN', 'en_US', 'de_DE' attendez</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">SECONDARY</span></span><span>); </span><span><span class="hljs-comment">// 忽略大小写但区分重音attendez</span></span><span>
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-string">'Chinois'</span></span><span>, </span><span><span class="hljs-string">'Personnage'</span></span><span>));     </span><span><span class="hljs-comment">// -1/0/1,Basé sur les règles linguistiques</span></span><span>
</span></span>
  • Besoin de "Nombre de caractères / tronqué / Traversé sur le visuel de l'utilisateur" : Utilisez Graphème_ * série (identique à INTL ) pour traiter "Grameme Clusters" pour éviter de couper un emoji en moitié:

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$text</span></span><span> = </span><span><span class="hljs-string">"?????Développement"</span></span><span>; </span><span><span class="hljs-comment">// Inclure ZWJ Connecteur emoji</span></span><span>
</span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-title function_ invoke__">grapheme_substr</span></span><span>(</span><span><span class="hljs-variable">$text</span></span><span>, </span><span><span class="hljs-number">0</span></span><span>, </span><span><span class="hljs-number">2</span></span><span>); </span><span><span class="hljs-comment">// ?????ouvrir</span></span><span>
</span></span>
  • Comparaisons multi-octets qui doivent être ignorées : utilisez MB_strtolower ou MB_Convert_Case avant de comparer (n'oubliez pas de normaliser d'abord):

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$a</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$b</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
</span><span><span class="hljs-variable">$aFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$a</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-variable">$bFold</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_strtolower</span></span><span>(</span><span><span class="hljs-variable">$b</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">var_dump</span></span><span>(</span><span><span class="hljs-title function_ invoke__">strcmp</span></span><span>(</span><span><span class="hljs-variable">$aFold</span></span><span>, </span><span><span class="hljs-variable">$bFold</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>);
</span></span>

4) Exemples de pratiques de tri et de vérification de la dilution

Trier par Pinyin chinois (schématique)

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$names</span></span><span> = [</span><span><span class="hljs-string">'Zhang San'</span></span><span>, </span><span><span class="hljs-string">'Li si'</span></span><span>, </span><span><span class="hljs-string">'Wang wu'</span></span><span>, </span><span><span class="hljs-string">'Ali'</span></span><span>, </span><span><span class="hljs-string">'Cao Cao'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'zh_CN@collation=pinyin'</span></span><span>); </span><span><span class="hljs-comment">// Nécessiter un système ICU soutien</span></span><span>
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">sort</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$names</span></span><span>);
</span></span>

"Deduplication humaine" qui ignore les cas et les stress en majuscules et majuscules

 <span><span><span class="hljs-meta">&lt;?php</span></span><span>
</span><span><span class="hljs-variable">$input</span></span><span> = [</span><span><span class="hljs-string">'café'</span></span><span>, </span><span><span class="hljs-string">'Cafe'</span></span><span>, </span><span><span class="hljs-string">'CAFé'</span></span><span>, </span><span><span class="hljs-string">'cafe'</span></span><span>];
</span><span><span class="hljs-variable">$coll</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">\Collator</span></span><span>(</span><span><span class="hljs-string">'fr_FR'</span></span><span>);
</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">setStrength</span></span><span>(</span><span><span class="hljs-title class_">\Collator</span></span><span>::</span><span><span class="hljs-variable constant_">PRIMARY</span></span><span>); </span><span><span class="hljs-comment">// Ignorer les accents et les cas majuscules</span></span><span>
</span><span><span class="hljs-variable">$unique</span></span><span> = [];
</span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$input</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$s</span></span><span>) {
    </span><span><span class="hljs-variable">$sN</span></span><span> = </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-title function_ invoke__">normalize</span></span><span>(</span><span><span class="hljs-variable">$s</span></span><span>, </span><span><span class="hljs-title class_">Normalizer</span></span><span>::</span><span><span class="hljs-variable constant_">FORM_C</span></span><span>);
    </span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">false</span></span><span>;
    </span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$unique</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$u</span></span><span>) {
        </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$coll</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">compare</span></span><span>(</span><span><span class="hljs-variable">$sN</span></span><span>, </span><span><span class="hljs-variable">$u</span></span><span>) === </span><span><span class="hljs-number">0</span></span><span>) { </span><span><span class="hljs-variable">$found</span></span><span> = </span><span><span class="hljs-literal">true</span></span><span>; </span><span><span class="hljs-keyword">break</span></span><span>; }
    }
    </span><span><span class="hljs-keyword">if</span></span><span> (!</span><span><span class="hljs-variable">$found</span></span><span>) </span><span><span class="hljs-variable">$unique</span></span><span>[] = </span><span><span class="hljs-variable">$sN</span></span><span>;
}
</span><span><span class="hljs-title function_ invoke__">print_r</span></span><span>(</span><span><span class="hljs-variable">$unique</span></span><span>); </span><span><span class="hljs-comment">// Une seule variante est préservée</span></span><span>
</span></span>

4. Liste rapide des erreurs (passez par le "code brouillé" en premier)

  1. Confirmer Encoding : MB_DETECT_ENCODING ($ S, ['UTF-8', 'GBK', 'BIG5', 'ISO-8859-1'], true) . Si vous n'êtes pas sûr, transférez d'abord à UTF-8.

  2. Accédez à Bom / Zero-Width Caractères : Voir Strip_Bom_and_Controls () ci-dessus.

  3. Normalisé à NFC : Normalizer :: Normalize () .

  4. Clarifier les objectifs de comparaison :

    • Cohérence des octets: strcmp / strcascmpMP .

    • TROT / EQUISATION DE LA LANGUE: Intl \ Collator .

    • Traitement du niveau de caractère visuel: Graphème_ * .

  5. La base de données est cohérente avec l'en-tête HTTP : MySQL utilise UTF8MB4 et une collation appropriée (telle que UTF8MB4_0900_AI_CI ), et HTTP Set Content-Type: Text / HTML; Charset = UTF-8 .

5. FAQ: plusieurs "fosses" typiques

  • "C'est le même UTF-8, pourquoi strcmp () n'est-il pas égal?"
    Il peut être mélangé avec BOM, des caractères à largeur zéro ou l'un est NFC et l'autre est NFD. Nettoyez + normaliser d'abord.

  • " StrcasecMP () peut-il être internationalisé en cas d'insensibilité?"
    Son pli est principalement une sémantique ASCII. Pratique plus fiable: Comparez après MB_strtolower () , ou utilisez la force appropriée du collator .

  • "Le mot emoji / combinaison est tronqué ou compté anormalement"
    Utilisez Graphème_strlen / Graphème_Substr et n'utilisez pas Strlen / substrat pour traiter les caractères visibles utilisateur.