How to Preserve Original Encoding After Cleaning a String with mb_scrub?

gitbox 2025-08-13

When dealing with multibyte strings, mb_scrub is a very practical function that helps us clean strings containing invalid characters, preventing the program from crashing due to encoding issues during later processing. However, many developers face a problem after using mb_scrub: it may cause encoding confusion within the system, particularly if your application depends on specific encodings like Shift_JIS, ISO-8859-1, and others.

So, how can you preserve the original encoding after cleaning a string with mb_scrub?

Problem Analysis

First, let’s look at the basic usage of mb_scrub:

<span><span><span class="hljs-variable">$clean</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_scrub</span></span><span>(</span><span><span class="hljs-variable">$dirty_string</span></span><span>);
</span></span>

If no encoding is specified, PHP defaults to the internal character encoding (usually UTF-8). mb_scrub attempts to convert the string to the specified encoding, replacing invalid characters with U+FFFD (?) if conversion fails. However, the returned string’s encoding is usually the encoding you specified when passing the input, not necessarily the original string’s encoding.

Therefore, if your original string is encoded in Shift_JIS but you clean it using the default mb_scrub($str), the result will be a UTF-8 encoded string, which may lead to garbled text or system incompatibility.

Solution: Explicitly Specify the Original Encoding

To fix this, you need to first detect the original string encoding, then explicitly pass this encoding when calling mb_scrub. For example:

<span><span><span class="hljs-variable">$original_encoding</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_detect_encoding</span></span><span>(</span><span><span class="hljs-variable">$dirty_string</span></span><span>, </span><span><span class="hljs-title function_ invoke__">mb_list_encodings</span></span><span>(), </span><span><span class="hljs-literal">true</span></span><span>);
</span><span><span class="hljs-variable">$clean_string</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_scrub</span></span><span>(</span><span><span class="hljs-variable">$dirty_string</span></span><span>, </span><span><span class="hljs-variable">$original_encoding</span></span><span>);
</span></span>

This way, mb_scrub knows which encoding to use to interpret the string, and the returned value will use the same encoding.

Note: The accuracy of mb_detect_encoding depends on the string content and the list of encodings provided. Some ambiguous encodings may not be correctly detected, so it’s recommended to narrow down the encoding list based on context whenever possible.

A More Robust Example

<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">clean_preserve_encoding</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-keyword">string</span></span></span><span> </span><span><span class="hljs-variable">$input</span></span><span>): </span><span><span class="hljs-title">string</span></span><span> {
    </span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_detect_encoding</span></span><span>(</span><span><span class="hljs-variable">$input</span></span><span>, [</span><span><span class="hljs-string">'SJIS'</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>, </span><span><span class="hljs-string">'ISO-8859-1'</span></span><span>, </span><span><span class="hljs-string">'EUC-JP'</span></span><span>], </span><span><span class="hljs-literal">true</span></span><span>);
    </span><span><span class="hljs-keyword">if</span></span><span> (!</span><span><span class="hljs-variable">$encoding</span></span><span>) {
        </span><span><span class="hljs-comment">// Unable to detect encoding, default to UTF-8 or throw an exception</span></span><span>
        </span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-string">'UTF-8'</span></span><span>;
    }
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-title function_ invoke__">mb_scrub</span></span><span>(</span><span><span class="hljs-variable">$input</span></span><span>, </span><span><span class="hljs-variable">$encoding</span></span><span>);
}
</span></span>

This function tries its best to preserve the input string’s encoding, even if there are invalid characters, ensuring the encoding consistency is not broken.

Additional Recommendations

Always log the original encoding: If your system needs to support multiple encodings, it’s a good practice to record the encoding of each text segment in the data flow.
Prefer UTF-8 whenever possible: If you can control the input and output environments, it’s recommended to unify everything under UTF-8 encoding to avoid complexity caused by mixed encodings.
Test extreme cases: Especially when dealing with external data, test scenarios involving mixed invalid bytes, incorrect BOMs, and encoding inconsistencies.

Summary

Using mb_scrub to clean invalid strings is an important method for safely handling multibyte strings, but it can change the string encoding if the encoding is not explicitly specified. To avoid this issue, always specify the original encoding when calling mb_scrub, ensuring that the cleaned string retains its original encoding.

This approach not only maintains data consistency but also reduces side effects caused by encoding conversions, making it an essential best practice when developing multilingual and multi-encoding compatible applications.