Current Location: Home> Latest Articles> How to use mb_strcut to process strings containing emoji characters

How to use mb_strcut to process strings containing emoji characters

gitbox 2025-05-26

mb_strcut is a very practical function in PHP when dealing with multibyte strings. Its main function is to intercept substrings by bytes from a multibyte string. However, when a string contains special characters such as emoji, using mb_strcut requires extra care.

How mb_strcut works

mb_strcut(string $string, int $start, ?int $length = null, ?string $encoding = null): string
This function intercepts strings based on byte offsets (rather than character offsets).

Unlike mb_substr , mb_strcut is actually a "byte-safe" version, but when it encounters some multibyte characters (such as emoji) being cut off, it will directly truncate rather than complete characters.

Let’s take a look at an example:

<code> $str = "Hello ?? World!"; $cut = mb_strcut($str, 0, 9, 'UTF-8'); echo $cut; </code>

You may expect the output to be Hello ?? , but you may actually see a broken string, even garbled. This is because ?? Under UTF-8 encoding is a 4-byte character, and mb_strcut may be truncated in the middle bytes.

Why is emoji particularly troublesome?

emoji is usually 4 bytes or even longer (e.g. compound emoji, such as ??????????). If you cut only by bytes without considering character boundaries, you may appear:

  • The output contains illegal characters;

  • The browser displays as garbled codes or question marks;

  • The database may report an error (especially in strict mode);

  • JSON encoding may fail.

How to gracefully intercept strings containing emoji?

If your goal is to display a text preview with emoji (such as a summary of contents such as Weibo, comments, etc.), you can consider the following methods:

Method 1: Use mb_substr instead of mb_strcut

If you don't mind intercepting in "characters", you can use mb_substr , which ensures that character boundaries are not broken:

<code> $str = "Hello ?? World!"; $preview = mb_substr($str, 0, 7, 'UTF-8'); echo $preview; </code>

This outputs the complete characters, not the broken bytes.

Method 2: Combining regular culling of illegal characters

If you insist on using mb_strcut (for example, to control the number of bytes), you can use regular removal of incomplete characters after truncation:

<code> $str = "Hello ?? World!"; $cut = mb_strcut($str, 0, 9, 'UTF-8');

// Use regular cleaning of illegal characters
$clean = preg_replace('/[\xC0-\xFF][\x80-\xBF]*$/', '', $cut);
echo $clean;
</code>

This code attempts to remove incomplete multibyte characters that may be truncated at the end.

Method 3: Use IntlBreakIterator to determine the boundary (recommended method)

PHP's intl extension provides character boundary detection, suitable for handling complex multibyte characters:

<code> $str = "Hello ?? World!"; $breakIterator = IntlBreakIterator::createCharacterInstance('en'); $breakIterator->setText($str);

$bytes = 0;
$limit = 9;
$pos = 0;

foreach ($breakIterator as $boundary) {
$chunk = mb_substr($str, $pos, $boundary - $pos, 'UTF-8');
$chunkBytes = strlen($chunk);
if ($bytes + $chunkBytes > $limit) {
break;
}
$bytes += $chunkBytes;
$pos = $boundary;
}

$preview = mb_substr($str, 0, $pos, 'UTF-8');
echo $preview;
</code>

This ensures that the string you intercept is still full of characters under byte limits, and is suitable for internationalization projects or complex text processing.

Summarize

When strings contain emoji or other multibyte characters, special attention is required to intercept strings using mb_strcut :

  • It is intercepted by bytes, which may destroy emoji;

  • After truncation, illegal characters need to be cleaned or combined with regular repair;

  • Using mb_substr is safer, but does not control bytes accurately;

  • It is recommended to use IntlBreakIterator to ensure that the truncation position is legal.

Be sure to test the integrity and compatibility of emoji processing in the user interface, database storage, interface output, etc. to avoid the problems of garbled code or data exceptions.

For more best practices about character processing, please refer to the documentation or visit https://gitbox.net/dev/mbstring .