Current Location: Home> Latest Articles> How to Use mb_substitute_character to Prevent String Truncation and Garbled Text

How to Use mb_substitute_character to Prevent String Truncation and Garbled Text

gitbox 2025-09-03

How to Use mb_substitute_character to Prevent String Truncation and Garbled Text

When working with multibyte encoded strings (such as UTF-8, GBK, etc.), garbled text or truncation issues often arise, especially during string manipulation or slicing. PHP offers several functions to handle these problems, and mb_substitute_character is an extremely useful tool. This article will explain how to use mb_substitute_character to prevent string truncation and garbled text issues.

1. Introduction to the mb_substitute_character Function

mb_substitute_character is a function in PHP's Multibyte String extension (mbstring). Its main purpose is to specify a substitute character to replace invalid or illegal characters encountered during multibyte string operations. This ensures that even if the input string contains unrecognizable characters, the program will not crash or output garbled text.

Function Prototype:

<span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>([</span><span><span class="hljs-keyword">int</span></span><span> </span><span><span class="hljs-variable">$substitute_char</span></span><span> = MB_SUBSTITUTE_CHARACTER]);
</span></span>
  • $substitute_char: Specifies the substitute character, which can be an integer representing a character code or the constant MB_SUBSTITUTE_CHARACTER. The default is MB_SUBSTITUTE_CHARACTER, usually represented as a question mark (“?”).

2. Why Do String Truncation and Garbled Text Occur?

When manipulating multibyte strings, such as slicing characters (mb_substr) or searching for characters (mb_strpos), failing to handle character boundaries correctly can result in garbled text or incomplete character truncation. Multibyte characters (like Chinese characters) do not occupy the same number of bytes per character. If slicing or operations are performed incorrectly, part of a character may be cut off, causing garbled output.

3. How to Use mb_substitute_character to Prevent Garbled Text and Truncation

To ensure that strings are processed without garbled text or truncation, use mb_substitute_character to set a substitute character. When unrecognizable or illegal characters are encountered, this character replaces the original one. This allows the program to continue running smoothly and produces user-friendly output, avoiding crashes or garbled text.

Example 1: Set the Substitute Character as a Question Mark (“?”)

<span><span><span class="hljs-comment">// Set substitute character as a question mark “?”</span></span><span>
</span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(<span class="hljs-string">&#039;?");
<p>// Example string<br>
$string = "Hello, 你好,世界!";</p>
<p>// Assuming we use a GBK-encoded string for multibyte slicing<br>
echo mb_substr($string, 0, 10, '</span>GBK');<br>
</span></span>

In this code, any unprocessable characters are replaced by a question mark (“?”) instead of causing errors or displaying garbled text.

Example 2: Use the Integer Code of a Substitute Character

You can also use the integer value of a substitute character for finer control. For example, using the Unicode “?” character to replace illegal characters.

<span><span><span class="hljs-comment">// Set substitute character as Unicode “?” (U+FFFD)</span></span><span>
</span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(</span><span><span class="hljs-number">0xFFFD</span></span><span>);
<p></span>// Example string<br>
$string = "Hello, 你好,world!";</p>
<p>// Slice using UTF-8 encoding<br>
echo mb_substr($string, 0, 10, 'UTF-8');<br>
</span>

Here, illegal or unrecognizable characters are replaced with “?”, allowing the program to continue gracefully even when encountering problematic characters.

4. Common Use Cases

4.1 When Slicing Multibyte Strings

When slicing multibyte strings, mb_substr can be used in conjunction with mb_substitute_character to prevent character truncation due to incorrect slice positions.

<span><span><span class="hljs-comment">// Set substitute character</span></span><span>
</span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(</span><span><span class="hljs-string">&#039;?&#039;</span></span><span>);
<p></span>// Example string containing multibyte characters<br>
$string = "这是一段测试文本";</p>
<p>// Slice the first ten characters<br>
$sub_string = mb_substr($string, 0, 10, 'UTF-8');</p>
<p>echo $sub_string;  </span>// Outputs “这是一段测试”<br>
</span>

Using the regular substr function could lead to garbled text. By using mb_substr with a proper substitute character, this issue is avoided.

4.2 During Encoding Conversion

When converting character encodings, incompatible characters may appear. mb_substitute_character can handle these cases to ensure the converted string does not become garbled.

<span><span><span class="hljs-comment">// Set substitute character</span></span><span>
</span><span><span class="hljs-title function_ invoke__">mb_substitute_character</span></span><span>(</span><span><span class="hljs-string">&#039;?&#039;</span></span><span>);
<p></span>// Convert a UTF-8 string to GBK<br>
$string = "这是一段UTF-8编码的字符串";<br>
$converted_string = mb_convert_encoding($string, 'GBK', 'UTF-8');</p>
<p>echo $converted_string;<br>
</span>

By setting a substitute character, any characters that cannot be converted will not affect the overall conversion process.

5. Conclusion

mb_substitute_character is a highly practical function that effectively prevents string truncation and garbled text issues. Proper use of mb_substitute_character during multibyte string operations enhances code robustness and ensures that illegal characters do not crash the program while producing user-friendly substitutes. Whether slicing characters, converting encodings, or searching within strings, mb_substitute_character is an essential tool for resolving garbled text and truncation problems.