当前位置: 首页> 最新文章列表> 使用 get_meta_tags 获取页面标题和关键字时的常见问题及解决办法

使用 get_meta_tags 获取页面标题和关键字时的常见问题及解决办法

gitbox 2025-09-16

get_meta_tags() 是 PHP 内置的一个方便函数,用来从远程或本地 HTML 文件中提取 <meta name="..."> 标签的内容。它常被用于抓取页面关键字(keywords)或描述(description)。然而在实际使用中,开发者会遇到各种问题:提取不到标题、关键字为空、字符编码错乱、远程请求失败、meta 标签写法不规范等。本文总结常见问题、产生原因,并给出对策与更健壮的替代方案(含可复制的 PHP 示例代码)。


1. get_meta_tags() 的工作方式与限制(先理解再调试)

  • get_meta_tags(string $filename, bool $use_include_path = false):它读取文件并尝试解析 <meta name="xxx" content="yyy">,返回一个关联数组 name => content(全部小写的 name)。

  • 它不会获取 <title> 标签内容(即页面标题),也不会解析 <meta property="og:..."><meta charset="..."> 等非 name 属性的 meta。

  • 它对 HTML 的要求相对严格:meta 必须以 name="..."content="..." 的形式存在,属性顺序或换行有时会影响解析。

结论:如果你需要页面 <title>,或 meta 用了 property(如 Open Graph),get_meta_tags() 单独使用就不够了。


2. 常见问题与解决办法一览

问题 A:无法获取 <title>(页面标题)

原因get_meta_tags() 不解析 <title>
解决办法:用 DOMDocument 或正则(不推荐)解析 <title>。示例(推荐 DOMDocument):

<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">fetch_title</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$html</span></span></span><span>) {
    </span><span><span class="hljs-variable">$dom</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">DOMDocument</span></span><span>();
    </span><span><span class="hljs-comment">// suppress warnings for malformed HTML</span></span><span>
    @</span><span><span class="hljs-variable">$dom</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">loadHTML</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, LIBXML_NOWARNING | LIBXML_NOERROR);
    </span><span><span class="hljs-variable">$nodes</span></span><span> = </span><span><span class="hljs-variable">$dom</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getElementsByTagName</span></span><span>(</span><span><span class="hljs-string">'title'</span></span><span>);
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$nodes</span></span><span>-&gt;length ? </span><span><span class="hljs-title function_ invoke__">trim</span></span><span>(</span><span><span class="hljs-variable">$nodes</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">item</span></span><span>(</span><span><span class="hljs-number">0</span></span><span>)-&gt;textContent) : </span><span><span class="hljs-literal">null</span></span><span>;
}
</span></span>

如果需要远程获取页面内容,请先 file_get_contents / curl 拉下 HTML,再传给 fetch_title()


问题 B:get_meta_tags() 返回空数组或缺失某些 meta

可能原因

  1. HTML meta 的写法不是 name="..." + content="..."(如 property="og:..."http-equiv)。

  2. meta 在 <head> 之外(或页面结构不规范)。

  3. 字符编码或 BOM 导致解析失败。

  4. allow_url_fopen 被禁用,无法使用 URL。

解决办法

  • 检查 meta 属性类型,必要时使用 DOMDocument 检查 meta->getAttribute('name')meta->getAttribute('property')

  • 对远程 URL,优先使用 curl 获取页面内容(更灵活),然后使用 DOM 解析。

  • allow_url_fopen 被禁用,改用 curl

示例:用 curl + DOM 提取常见 meta(包括 nameproperty)与标题:

<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">fetch_html</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$url</span></span></span><span>, </span><span><span class="hljs-variable">$timeout</span></span><span> = </span><span><span class="hljs-number">10</span></span><span>) {
    </span><span><span class="hljs-variable">$ch</span></span><span> = </span><span><span class="hljs-title function_ invoke__">curl_init</span></span><span>(</span><span><span class="hljs-variable">$url</span></span><span>);
    </span><span><span class="hljs-title function_ invoke__">curl_setopt_array</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>, [
        CURLOPT_RETURNTRANSFER =&gt; </span><span><span class="hljs-literal">true</span></span><span>,
        CURLOPT_FOLLOWLOCATION =&gt; </span><span><span class="hljs-literal">true</span></span><span>,
        CURLOPT_MAXREDIRS =&gt; </span><span><span class="hljs-number">5</span></span><span>,
        CURLOPT_CONNECTTIMEOUT =&gt; </span><span><span class="hljs-variable">$timeout</span></span><span>,
        CURLOPT_TIMEOUT =&gt; </span><span><span class="hljs-variable">$timeout</span></span><span>,
        CURLOPT_USERAGENT =&gt; </span><span><span class="hljs-string">'Mozilla/5.0 (compatible; PHP script)'</span></span><span>,
    ]);
    </span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">curl_exec</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>);
    </span><span><span class="hljs-variable">$err</span></span><span> = </span><span><span class="hljs-title function_ invoke__">curl_error</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>);
    </span><span><span class="hljs-title function_ invoke__">curl_close</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>);
    </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$html</span></span><span> === </span><span><span class="hljs-literal">false</span></span><span>) {
        </span><span><span class="hljs-keyword">throw</span></span><span> </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-built_in">RuntimeException</span></span><span>(</span><span><span class="hljs-string">"Failed to fetch URL: <span class="hljs-subst">$err</span></span></span><span>");
    }
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$html</span></span><span>;
}

</span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">parse_meta_and_title</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$html</span></span></span><span>) {
    </span><span><span class="hljs-variable">$dom</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">DOMDocument</span></span><span>();
    @</span><span><span class="hljs-variable">$dom</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">loadHTML</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, LIBXML_NOWARNING | LIBXML_NOERROR);
    </span><span><span class="hljs-variable">$result</span></span><span> = [</span><span><span class="hljs-string">'title'</span></span><span> =&gt; </span><span><span class="hljs-literal">null</span></span><span>, </span><span><span class="hljs-string">'meta'</span></span><span> =&gt; []];

    </span><span><span class="hljs-comment">// title</span></span><span>
    </span><span><span class="hljs-variable">$titles</span></span><span> = </span><span><span class="hljs-variable">$dom</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getElementsByTagName</span></span><span>(</span><span><span class="hljs-string">'title'</span></span><span>);
    </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$titles</span></span><span>-&gt;length) {
        </span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'title'</span></span><span>] = </span><span><span class="hljs-title function_ invoke__">trim</span></span><span>(</span><span><span class="hljs-variable">$titles</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">item</span></span><span>(</span><span><span class="hljs-number">0</span></span><span>)-&gt;textContent);
    }

    </span><span><span class="hljs-comment">// metas</span></span><span>
    </span><span><span class="hljs-variable">$metas</span></span><span> = </span><span><span class="hljs-variable">$dom</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getElementsByTagName</span></span><span>(</span><span><span class="hljs-string">'meta'</span></span><span>);
    </span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$metas</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$meta</span></span><span>) {
        </span><span><span class="hljs-variable">$name</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'name'</span></span><span>);
        </span><span><span class="hljs-variable">$prop</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'property'</span></span><span>); </span><span><span class="hljs-comment">// og: 等</span></span><span>
        </span><span><span class="hljs-variable">$http_equiv</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'http-equiv'</span></span><span>);
        </span><span><span class="hljs-variable">$content</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'content'</span></span><span>);

        </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$name</span></span><span>) {
            </span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>][</span><span><span class="hljs-title function_ invoke__">strtolower</span></span><span>(</span><span><span class="hljs-variable">$name</span></span><span>)] = </span><span><span class="hljs-variable">$content</span></span><span>;
        } </span><span><span class="hljs-keyword">elseif</span></span><span> (</span><span><span class="hljs-variable">$prop</span></span><span>) {
            </span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>][</span><span><span class="hljs-title function_ invoke__">strtolower</span></span><span>(</span><span><span class="hljs-variable">$prop</span></span><span>)] = </span><span><span class="hljs-variable">$content</span></span><span>;
        } </span><span><span class="hljs-keyword">elseif</span></span><span> (</span><span><span class="hljs-variable">$http_equiv</span></span><span>) {
            </span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>][</span><span><span class="hljs-title function_ invoke__">strtolower</span></span><span>(</span><span><span class="hljs-variable">$http_equiv</span></span><span>)] = </span><span><span class="hljs-variable">$content</span></span><span>;
        }
    }
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$result</span></span><span>;
}
</span></span>

问题 C:字符编码(中文等多字节)乱码

原因

  • 页面使用的编码(如 UTF-8、GBK)与 DOMDocument::loadHTML 默认行为不匹配。

  • HTTP header 与页面 meta 中的 charset 信息不一致。

解决办法

  • loadHTML() 前把 HTML 转为 UTF-8(若不是),并在头部注入 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">,这样 DOMDocument 更易识别。

  • 使用 mb_detect_encoding() 判断编码并转换为 UTF-8。

示例:

<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">normalize_to_utf8</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$html</span></span></span><span>) {
    </span><span><span class="hljs-comment">// 尝试通过 BOM 或 meta 判断编码,若不确定则用 mb_detect_encoding</span></span><span>
    </span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-literal">null</span></span><span>;
    </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">preg_match</span></span><span>(</span><span><span class="hljs-string">'/&lt;meta.+?charset=["\']?\s*([a-zA-Z0-9\-\_]+)\b/i'</span></span><span>, </span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-variable">$m</span></span><span>)) {
        </span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-title function_ invoke__">strtoupper</span></span><span>(</span><span><span class="hljs-variable">$m</span></span><span>[</span><span><span class="hljs-number">1</span></span><span>]);
    }
    </span><span><span class="hljs-keyword">if</span></span><span> (!</span><span><span class="hljs-variable">$encoding</span></span><span>) {
        </span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_detect_encoding</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, [</span><span><span class="hljs-string">'UTF-8'</span></span><span>,</span><span><span class="hljs-string">'GB2312'</span></span><span>,</span><span><span class="hljs-string">'GBK'</span></span><span>,</span><span><span class="hljs-string">'ISO-8859-1'</span></span><span>], </span><span><span class="hljs-literal">true</span></span><span>);
    }
    </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$encoding</span></span><span> &amp;&amp; </span><span><span class="hljs-title function_ invoke__">strtoupper</span></span><span>(</span><span><span class="hljs-variable">$encoding</span></span><span>) !== </span><span><span class="hljs-string">'UTF-8'</span></span><span>) {
        </span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>, </span><span><span class="hljs-variable">$encoding</span></span><span>);
    }
    </span><span><span class="hljs-comment">// 保证 loadHTML 识别为 UTF-8</span></span><span>
    </span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">stripos</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-string">'&lt;meta http-equiv="Content-Type"'</span></span><span>) === </span><span><span class="hljs-literal">false</span></span><span>) {
        </span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/&lt;head([^&gt;]*)&gt;/i'</span></span><span>, </span><span><span class="hljs-string">'&lt;head$1&gt;&lt;meta http-equiv="Content-Type" content="text/html; charset=utf-8"&gt;'</span></span><span>, </span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-number">1</span></span><span>);
    }
    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$html</span></span><span>;
}
</span></span>

问题 D:get_meta_tags() 对 HTML 注释或不规则格式敏感

原因:函数内部基于简单的解析器,遇到换行、注释、或在 content 属性中嵌套奇怪字符会失败。
解决办法:使用 DOMDocument 容错更好;或者先把 HTML 中头部做预处理(去除注释、压平属性到同一行)再调用 get_meta_tags()(不太优雅,但可作为短期补救)。


问题 E:抓取远程页面超时、被反爬或返回 403/429

对策

  • 使用 CURLOPT_USERAGENT 设置常见浏览器 UA。

  • 设置合理的 CURLOPT_TIMEOUTCURLOPT_CONNECTTIMEOUT

  • 支持 CURLOPT_FOLLOWLOCATION(注意在某些环境中需要启用)。

  • 若站点有反爬策略(验证码、JS 渲染、反机器人),考虑:

    • 简单请求头伪装(但要遵守法律与网站 robots 协议)。

    • 使用带 JS 的抓取工具(如 headless 浏览器),但这超出 PHP 原生范畴。

  • 处理 HTTP 状态码并在失败时重试(指数回退),但避免过度请求。

示例:带 headers 的 curl:

<span><span><span class="hljs-title function_ invoke__">curl_setopt_array</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>, [
    CURLOPT_RETURNTRANSFER =&gt; </span><span><span class="hljs-literal">true</span></span><span>,
    CURLOPT_FOLLOWLOCATION =&gt; </span><span><span class="hljs-literal">true</span></span><span>,
    CURLOPT_MAXREDIRS =&gt; </span><span><span class="hljs-number">5</span></span><span>,
    CURLOPT_CONNECTTIMEOUT =&gt; </span><span><span class="hljs-number">10</span></span><span>,
    CURLOPT_TIMEOUT =&gt; </span><span><span class="hljs-number">15</span></span><span>,
    CURLOPT_USERAGENT =&gt; </span><span><span class="hljs-string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'</span></span><span>,
    CURLOPT_HTTPHEADER =&gt; [
        </span><span><span class="hljs-string">'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'</span></span><span>,
        </span><span><span class="hljs-string">'Accept-Language: en-US,en;q=0.5'</span></span><span>,
    ],
]);
</span></span>

问题 F:get_meta_tags() 只返回小写的键名

这是函数的设计:键名会被转为小写。如果你的业务依赖大小写敏感的字段,请注意标准化键名。


3. 推荐的稳健实现(统一获取 title、keywords、description、og 标签)

下面给出一个组合函数:先用 curl 取 HTML,再做编码归一化,最后用 DOM 解析并返回常见字段与所有 meta 列表。

<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">fetch_page_info</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$url</span></span></span><span>) {
    </span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">fetch_html</span></span><span>(</span><span><span class="hljs-variable">$url</span></span><span>);
    </span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">normalize_to_utf8</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>);
    </span><span><span class="hljs-variable">$data</span></span><span> = </span><span><span class="hljs-title function_ invoke__">parse_meta_and_title</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>);

    </span><span><span class="hljs-comment">// 常见字段规范化:title, keywords, description</span></span><span>
    </span><span><span class="hljs-variable">$info</span></span><span> = [];
    </span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'title'</span></span><span>] = </span><span><span class="hljs-variable">$data</span></span><span>[</span><span><span class="hljs-string">'title'</span></span><span>] ?? </span><span><span class="hljs-literal">null</span></span><span>;
    </span><span><span class="hljs-variable">$meta</span></span><span> = </span><span><span class="hljs-variable">$data</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>] ?? [];

    </span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'keywords'</span></span><span>] = </span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'keywords'</span></span><span>] ?? (</span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'og:site_name'</span></span><span>] ?? </span><span><span class="hljs-literal">null</span></span><span>);
    </span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'description'</span></span><span>] = </span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'description'</span></span><span>] ?? (</span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'og:description'</span></span><span>] ?? </span><span><span class="hljs-literal">null</span></span><span>);

    </span><span><span class="hljs-comment">// 返回所有 meta 以便进一步使用</span></span><span>
    </span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'meta_all'</span></span><span>] = </span><span><span class="hljs-variable">$meta</span></span><span>;

    </span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$info</span></span><span>;
}

</span><span><span class="hljs-comment">// 使用例子:</span></span><span>
</span><span><span class="hljs-keyword">try</span></span><span> {
    </span><span><span class="hljs-variable">$url</span></span><span> = </span><span><span class="hljs-string">'https://example.com'</span></span><span>;
    </span><span><span class="hljs-variable">$info</span></span><span> = </span><span><span class="hljs-title function_ invoke__">fetch_page_info</span></span><span>(</span><span><span class="hljs-variable">$url</span></span><span>);
    </span><span><span class="hljs-title function_ invoke__">var_export</span></span><span>(</span><span><span class="hljs-variable">$info</span></span><span>);
} </span><span><span class="hljs-keyword">catch</span></span><span> (</span><span><span class="hljs-built_in">Exception</span></span><span> </span><span><span class="hljs-variable">$e</span></span><span>) {
    </span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-string">"Error: "</span></span><span> . </span><span><span class="hljs-variable">$e</span></span><span>-&gt;</span><span><span class="hljs-title function_ invoke__">getMessage</span></span><span>();
}
</span></span>

4. 性能与缓存建议

  • 若需要批量抓取大量页面,请不要每次都实时抓取同一 URL。建议使用缓存(Redis、Memcached 或文件缓存)并设置合适的过期策略,例如 1 小时或 24 小时,视页面更新频率而定。

  • 并发抓取时控制并发数,避免被目标站点封禁或自己主机压力过高。

  • 对于大型站点优先抓取首页 & 重要页面,避免盲目抓取所有链接。


5. 补充技巧与注意事项

  • meta tags 写法不统一:很多现代站点使用 og:titletwitter:title,这些都不在 get_meta_tags() 的目标范围内,使用 DOM 能一次性抓取全部类型。

  • meta 标签重复:如果页面中出现多个同名 meta(可能用于多语言或版本控制),你的解析逻辑应决定是取第一个、合并还是全部保存。

  • meta 中的 HTML 实体:注意对 &&#123; 等实体进行解码(html_entity_decode())。

  • robots/meta-refresh:如果需要处理 meta refresh(重定向)或 robots noindex,请专门检查 http-equiv 和相应属性。

  • 遵守 robots.txt 与法律:抓取前请检查目标站点 robots.txt 与服务条款,尊重隐私与版权,不要抓取受限制内容。


6. 实用 checklist(快速排查步骤)

  1. 确认你要抓取的是 <meta name="keywords"> 还是 <title>(两者不同工具)。

  2. 如果是远程抓取:先用 curl 获取并打印原始 HTML,查看 meta 的具体写法与编码。

  3. 检查 charset,若非 UTF-8,先转换再解析。

  4. get_meta_tags() 无法提取,切换到 DOMDocument,并同时捕获 namepropertyhttp-equiv

  5. 处理 HTTP 错误、重定向与反爬机制(适当设置 UA、超时与重试策略)。

  6. 对重要页面实现缓存,避免重复请求。


7. 总结

  • get_meta_tags() 简单易用,但只适用于标准且简单的 meta name="..." 场景。它不会抓取 <title>property 类型的 meta。

  • 面对复杂、非标准或非 UTF-8 的页面,推荐使用 curl + DOMDocument 的组合:更灵活、鲁棒性更高。

  • 编码、远程请求失败、反爬、meta 写法不规范是常见故障点,按照上文的排查顺序即可定位并修复大部分问题。

  • 若需处理需要 JS 渲染的页面(SPA、动态加载 meta),则需要使用 headless 浏览器或服务器端渲染的方案(超出 PHP 原生范围)。