mb_substitute_character() is a multibyte string handling function in PHP that is used to set or get the substitute character when invalid characters are encountered. Invalid characters usually refer to characters that cannot be represented in the current character encoding, which often occurs when dealing with different language character sets.
<span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(</span><span><span class="hljs-keyword">string</span></span><span> </span><span><span class="hljs-variable">$substitute_character</span></span><span> = </span><span><span class="hljs-literal">NULL</span></span><span>): </span><span><span class="hljs-keyword">mixed</span></span><span>
</span></span>
$substitute_character: Specifies a substitute character or the encoding of the substitute character. When set to NULL, it returns the current substitute character.
Return value: Returns the encoding of the current substitute character.
In a multilingual environment, encoding mismatches frequently occur during data input and output processes. For example, when trying to transmit a string containing certain characters to a system that does not support that character set, or when parsing data containing invalid characters, the program may throw errors. To avoid these issues, we can use mb_substitute_character() to set a substitute character to ensure invalid characters are handled correctly.
You can get the current substitute character by calling mb_substitute_character() without passing any arguments.
<span><span><span class="hljs-variable">$current_substitute</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>();
</span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-string">"Current substitute character encoding: "</span></span><span> . </span><span><span class="hljs-variable">$current_substitute</span></span><span>;
</span></span>
By default, mb_substitute_character() returns the encoding value representing the substitute character. Usually, the default substitute character is 0xFFFD, which is the "replacement character" defined by the Unicode standard.
To set a new substitute character, pass the encoding of the substitute character as a parameter to mb_substitute_character(). For example, you can set it to a specific character such as a question mark (?):
<span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(</span><span><span class="hljs-string">'?'</span></span><span>);
</span></span>
At this point, all invalid characters will be replaced with question marks during the conversion process.
The most common use case for mb_substitute_character() is when performing encoding conversion with mb_convert_encoding(). Suppose you want to convert a string containing invalid characters from one encoding to another. Setting a substitute character can ensure no errors are thrown during the conversion.
<span><span><span class="hljs-comment">// Set substitute character to question mark</span></span><span>
</span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(</span><span><span class="hljs-string">'?'</span></span><span>);
<p></span>// Convert encoding and replace invalid characters<br>
$converted_str = mb_convert_encoding($input_string, 'UTF-8', 'ISO-8859-1');<br>
echo $converted_str;<br>
</span>
In this example, if $input_string contains any invalid characters, they will be converted to question marks.
Besides using the default substitute character (0xFFFD) or a question mark (?), you can set it to any character you want. For example, use an asterisk (*) as the substitute character:
<span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(</span><span><span class="hljs-string">'*'</span></span><span>);
</span></span>
This can help you more clearly mark the position of invalid characters in some cases.
You should choose an appropriate substitute character based on your actual needs. If the substitute character is visible to users, it’s best to pick a noticeable and uncommon symbol, such as ? or *.
When handling encodings, ensure the target encoding supports the substitute character you choose. If the chosen character cannot be represented in the target encoding, it may still be replaced by the default substitute character.
When processing data in bulk, especially when getting data from external or untrusted sources, setting a proper substitute character can effectively prevent data corruption or program errors.
mb_substitute_character() provides flexible control when handling string encoding conversions. When encountering invalid characters, it allows you to replace them with a clear character, avoiding program crashes or incorrect output. Mastering this function not only enhances your ability to handle multilingual text but also improves the robustness of your programs. By setting substitute characters reasonably, developers can effectively avoid problems caused by encoding inconsistencies.