When dealing with multibyte strings, PHP provides mbstring extensions to ensure that character encoding is correctly parsed. Regular expressions may also be affected by encoding settings, especially when we need to deal with multibyte encoded text such as UTF-8, Shift-JIS, or EUC-JP. Fortunately, PHP provides mb_regex_encoding() and mb_get_info() functions to help developers manage the encoding settings of regular expressions.
The mb_regex_encoding() function can be used to set or get the encoding currently used for multibyte regular expressions.
grammar:
mb_regex_encoding(?string $encoding = null): string|bool
If no parameters are passed, it will return the current encoding;
If an encoding parameter is passed in, it sets the encoding of the regular expression and returns the previous encoding.
mb_get_info() is used to obtain detailed configuration information of the current mbstring environment, including language, internal encoding, HTTP input/output encoding and regular expression encoding, etc.
usage:
mb_get_info(?string $type = null): array|string|false
When $type is set to 'regex_encoding' , it returns the encoding currently used for the regular.
Here is a complete example showing how to set regular encoding using mb_regex_encoding() and verify whether the settings are effective through mb_get_info() :
<?php
// Set regular expressions to use UTF-8 coding
$previousEncoding = mb_regex_encoding('UTF-8');
echo "原本的正则表达式coding为:$previousEncoding\n";
// 验证当前正则表达式coding是否为 UTF-8
$currentRegexEncoding = mb_get_info('regex_encoding');
echo "当前的正则表达式coding为:$currentRegexEncoding\n";
// Sample regular match
$pattern = '\A[\p{Hiragana}ー]+\z'; // Match Hiragana characters
$subject = 'こんにちは';
if (mb_ereg($pattern, $subject)) {
echo "Match successfully:$subject It's Hiragana text\n";
} else {
echo "Match failed:$subject Not in compliance with Hiragana rules\n";
}
?>
In the above example:
We first set the regular expression encoding to UTF-8;
Then use mb_get_info() to check the settings;
Then use mb_ereg() to match the Unicode attribute.
Make sure your PHP installation has mbstring extension enabled and supports regular Unicode mode.
When dealing with content in multiple languages, especially involving Chinese characters, pseudonyms or other special characters, using appropriate encodings can avoid garbled or inaccurate results in regular matches. If you use regular expressions that do not support the current text encoding, it is easy to cause matching failures or even errors.
Q: If I do not explicitly set the encoding of the regular expression, what is the default? A: The default internal character encoding will be used (set by mb_internal_encoding() ), but the specific value may vary depending on the system environment, so it is recommended to set it explicitly.
Q: How to check whether PHP supports mbstring ? A: You can run phpinfo() or use extension_loaded('mbstring') to check.