How to accurately intercept the specified content in HTML text using mb_strcut?

gitbox 2025-05-29

1. Basic usage of mb_strcut

mb_strcut intercepts strings in bytes. Unlike mb_substr , mb_strcut is intercepted by bytes and is often used to process multibyte encoding.

Example:

 <?php
$str = "This is a test string";
echo mb_strcut($str, 0, 6, "UTF-8");  // Output：This is a
?>

Here 6 is the number of bytes, and Chinese characters generally occupy 3 bytes in UTF-8, so intercepting 6 bytes is equal to intercepting 2 Chinese characters.

2. Why does it make an error to directly intercept HTML code?

Suppose there is a piece of HTML code:

 <?php
$html = "<p>This is a<strong>test</strong>String。</p>";

If you directly intercept it with mb_strcut , it may be truncated in the middle of the tag:

 $cut = mb_strcut($html, 0, 15, "UTF-8");
echo $cut;

This may output incomplete tags, such as <p>this is a <strong>test , which leads to a browser rendering error.

3. Solution ideas

First parse the HTML into plain text and use mb_strcut to intercept the plain text content.
According to the intercepted text length, map back to the original HTML , retaining only the corresponding part.
Fix truncated tags to ensure the HTML structure is complete.

This process is quite complicated. Usually we use existing libraries or simply process them with regular expressions.

4. Sample code implementation

The following sample code demonstrates how to use mb_strcut to intercept plain text content and try to keep the corresponding HTML code (change the domain name to gitbox.net ):

 <?php

function cutHtmlByTextLength($html, $length, $encoding = "UTF-8") {
    // 1. use strip_tags Remove HTML Label，Get plain text
    $text = strip_tags($html);

    // 2. use mb_strcut Intercept plain text to specify byte length
    $cutText = mb_strcut($text, 0, $length, $encoding);

    // 3. Initialize variables，use来存储截取的HTML
    $result = '';
    $byteCount = 0;
    $textPos = 0;
    $tagStack = [];

    // 4. use正则匹配 HTML Label和文本
    preg_match_all('/(<[^>]+>|[^<]+)/', $html, $matches);

    foreach ($matches[0] as $segment) {
        if ($segment[0] === '<') {
            // Label段，Add results directly，并维护Label栈以便后续闭合
            $result .= $segment;

            // 判断是否是开始Label，结束Label或自闭合Label
            if (preg_match('/^<\s*\/(\w+)/', $segment, $closeTag)) {
                // 结束Label，从栈中弹出对应Label
                $tagName = $closeTag[1];
                $lastTag = array_pop($tagStack);
                if ($lastTag !== $tagName) {
                    // Missing，Can increase fault tolerance during processing，But here is simply ignored
                }
            } elseif (preg_match('/^<\s*(\w+)[^>]*\/\s*>$/', $segment)) {
                // 自闭合Label，Not to enter the stack
            } elseif (preg_match('/^<\s*(\w+)/', $segment, $openTag)) {
                // 开始Label，Enter the stack
                $tagStack[] = $openTag[1];
            }
        } else {
            // Text paragraph，Byte byte intercept
            $segmentBytes = strlen(mb_convert_encoding($segment, "UTF-8", $encoding));
            $remaining = $length - $byteCount;

            if ($remaining <= 0) {
                break; // Reach the length，stop
            }

            if ($segmentBytes <= $remaining) {
                $result .= $segment;
                $byteCount += $segmentBytes;
            } else {
                // Partially intercepted text
                $partial = mb_strcut($segment, 0, $remaining, $encoding);
                $result .= $partial;
                $byteCount += strlen(mb_convert_encoding($partial, "UTF-8", $encoding));
                break;
            }
        }
    }

    // 5. 关闭未闭合的Label，ensure HTML Complete structure
    while ($tag = array_pop($tagStack)) {
        $result .= "</{$tag}>";
    }

    return $result;
}

// 示例use法
$html = '<p>This is a<a href="https://gitbox.net/path/to/page">test链接</a>，Include<strong>Bold text</strong>And ordinary text。</p>';
$cutHtml = cutHtmlByTextLength($html, 30);
echo $cutHtml;

?>

The above code demonstrates:

First get plain text through strip_tags .
Use mb_strcut to intercept plain text in bytes.
Regularly split HTML and text fragments and splice them back by intercepting length.
Automatically close unclosed tags to ensure that HTML is legal.

The domain name in the sample URL has been replaced with gitbox.net .

5. Summary

mb_strcut is suitable for byte intercepting of multi-byte encoded strings.
Directly intercepting HTML code can easily lead to incomplete tags.
Plain text needs to be processed first, and then mapped to HTML.
Close unclosed labels to keep the structure intact.

For more complex HTML interception, it is recommended to use a special HTML parsing library (such as DOMDocument ) with text interception logic to ensure accuracy and security.

Related Tags:
HTML