get_meta_tags() 是 PHP 内置的一个方便函数,用来从远程或本地 HTML 文件中提取 <meta name="..."> 标签的内容。它常被用于抓取页面关键字(keywords)或描述(description)。然而在实际使用中,开发者会遇到各种问题:提取不到标题、关键字为空、字符编码错乱、远程请求失败、meta 标签写法不规范等。本文总结常见问题、产生原因,并给出对策与更健壮的替代方案(含可复制的 PHP 示例代码)。
get_meta_tags(string $filename, bool $use_include_path = false):它读取文件并尝试解析 <meta name="xxx" content="yyy">,返回一个关联数组 name => content(全部小写的 name)。
它不会获取 <title> 标签内容(即页面标题),也不会解析 <meta property="og:..."> 或 <meta charset="..."> 等非 name 属性的 meta。
它对 HTML 的要求相对严格:meta 必须以 name="..." 与 content="..." 的形式存在,属性顺序或换行有时会影响解析。
结论:如果你需要页面 <title>,或 meta 用了 property(如 Open Graph),get_meta_tags() 单独使用就不够了。
原因:get_meta_tags() 不解析 <title>。
解决办法:用 DOMDocument 或正则(不推荐)解析 <title>。示例(推荐 DOMDocument):
<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">fetch_title</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$html</span></span></span><span>) {
</span><span><span class="hljs-variable">$dom</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">DOMDocument</span></span><span>();
</span><span><span class="hljs-comment">// suppress warnings for malformed HTML</span></span><span>
@</span><span><span class="hljs-variable">$dom</span></span><span>-></span><span><span class="hljs-title function_ invoke__">loadHTML</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, LIBXML_NOWARNING | LIBXML_NOERROR);
</span><span><span class="hljs-variable">$nodes</span></span><span> = </span><span><span class="hljs-variable">$dom</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getElementsByTagName</span></span><span>(</span><span><span class="hljs-string">'title'</span></span><span>);
</span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$nodes</span></span><span>->length ? </span><span><span class="hljs-title function_ invoke__">trim</span></span><span>(</span><span><span class="hljs-variable">$nodes</span></span><span>-></span><span><span class="hljs-title function_ invoke__">item</span></span><span>(</span><span><span class="hljs-number">0</span></span><span>)->textContent) : </span><span><span class="hljs-literal">null</span></span><span>;
}
</span></span>
如果需要远程获取页面内容,请先 file_get_contents / curl 拉下 HTML,再传给 fetch_title()。
可能原因:
HTML meta 的写法不是 name="..." + content="..."(如 property="og:..." 或 http-equiv)。
meta 在 <head> 之外(或页面结构不规范)。
字符编码或 BOM 导致解析失败。
allow_url_fopen 被禁用,无法使用 URL。
解决办法:
检查 meta 属性类型,必要时使用 DOMDocument 检查 meta->getAttribute('name') 与 meta->getAttribute('property')。
对远程 URL,优先使用 curl 获取页面内容(更灵活),然后使用 DOM 解析。
若 allow_url_fopen 被禁用,改用 curl。
示例:用 curl + DOM 提取常见 meta(包括 name 与 property)与标题:
<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">fetch_html</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$url</span></span></span><span>, </span><span><span class="hljs-variable">$timeout</span></span><span> = </span><span><span class="hljs-number">10</span></span><span>) {
</span><span><span class="hljs-variable">$ch</span></span><span> = </span><span><span class="hljs-title function_ invoke__">curl_init</span></span><span>(</span><span><span class="hljs-variable">$url</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">curl_setopt_array</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>, [
CURLOPT_RETURNTRANSFER => </span><span><span class="hljs-literal">true</span></span><span>,
CURLOPT_FOLLOWLOCATION => </span><span><span class="hljs-literal">true</span></span><span>,
CURLOPT_MAXREDIRS => </span><span><span class="hljs-number">5</span></span><span>,
CURLOPT_CONNECTTIMEOUT => </span><span><span class="hljs-variable">$timeout</span></span><span>,
CURLOPT_TIMEOUT => </span><span><span class="hljs-variable">$timeout</span></span><span>,
CURLOPT_USERAGENT => </span><span><span class="hljs-string">'Mozilla/5.0 (compatible; PHP script)'</span></span><span>,
]);
</span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">curl_exec</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>);
</span><span><span class="hljs-variable">$err</span></span><span> = </span><span><span class="hljs-title function_ invoke__">curl_error</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">curl_close</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>);
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$html</span></span><span> === </span><span><span class="hljs-literal">false</span></span><span>) {
</span><span><span class="hljs-keyword">throw</span></span><span> </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-built_in">RuntimeException</span></span><span>(</span><span><span class="hljs-string">"Failed to fetch URL: <span class="hljs-subst">$err</span></span></span><span>");
}
</span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$html</span></span><span>;
}
</span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">parse_meta_and_title</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$html</span></span></span><span>) {
</span><span><span class="hljs-variable">$dom</span></span><span> = </span><span><span class="hljs-keyword">new</span></span><span> </span><span><span class="hljs-title class_">DOMDocument</span></span><span>();
@</span><span><span class="hljs-variable">$dom</span></span><span>-></span><span><span class="hljs-title function_ invoke__">loadHTML</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, LIBXML_NOWARNING | LIBXML_NOERROR);
</span><span><span class="hljs-variable">$result</span></span><span> = [</span><span><span class="hljs-string">'title'</span></span><span> => </span><span><span class="hljs-literal">null</span></span><span>, </span><span><span class="hljs-string">'meta'</span></span><span> => []];
</span><span><span class="hljs-comment">// title</span></span><span>
</span><span><span class="hljs-variable">$titles</span></span><span> = </span><span><span class="hljs-variable">$dom</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getElementsByTagName</span></span><span>(</span><span><span class="hljs-string">'title'</span></span><span>);
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$titles</span></span><span>->length) {
</span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'title'</span></span><span>] = </span><span><span class="hljs-title function_ invoke__">trim</span></span><span>(</span><span><span class="hljs-variable">$titles</span></span><span>-></span><span><span class="hljs-title function_ invoke__">item</span></span><span>(</span><span><span class="hljs-number">0</span></span><span>)->textContent);
}
</span><span><span class="hljs-comment">// metas</span></span><span>
</span><span><span class="hljs-variable">$metas</span></span><span> = </span><span><span class="hljs-variable">$dom</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getElementsByTagName</span></span><span>(</span><span><span class="hljs-string">'meta'</span></span><span>);
</span><span><span class="hljs-keyword">foreach</span></span><span> (</span><span><span class="hljs-variable">$metas</span></span><span> </span><span><span class="hljs-keyword">as</span></span><span> </span><span><span class="hljs-variable">$meta</span></span><span>) {
</span><span><span class="hljs-variable">$name</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'name'</span></span><span>);
</span><span><span class="hljs-variable">$prop</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'property'</span></span><span>); </span><span><span class="hljs-comment">// og: 等</span></span><span>
</span><span><span class="hljs-variable">$http_equiv</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'http-equiv'</span></span><span>);
</span><span><span class="hljs-variable">$content</span></span><span> = </span><span><span class="hljs-variable">$meta</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getAttribute</span></span><span>(</span><span><span class="hljs-string">'content'</span></span><span>);
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$name</span></span><span>) {
</span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>][</span><span><span class="hljs-title function_ invoke__">strtolower</span></span><span>(</span><span><span class="hljs-variable">$name</span></span><span>)] = </span><span><span class="hljs-variable">$content</span></span><span>;
} </span><span><span class="hljs-keyword">elseif</span></span><span> (</span><span><span class="hljs-variable">$prop</span></span><span>) {
</span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>][</span><span><span class="hljs-title function_ invoke__">strtolower</span></span><span>(</span><span><span class="hljs-variable">$prop</span></span><span>)] = </span><span><span class="hljs-variable">$content</span></span><span>;
} </span><span><span class="hljs-keyword">elseif</span></span><span> (</span><span><span class="hljs-variable">$http_equiv</span></span><span>) {
</span><span><span class="hljs-variable">$result</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>][</span><span><span class="hljs-title function_ invoke__">strtolower</span></span><span>(</span><span><span class="hljs-variable">$http_equiv</span></span><span>)] = </span><span><span class="hljs-variable">$content</span></span><span>;
}
}
</span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$result</span></span><span>;
}
</span></span>
原因:
页面使用的编码(如 UTF-8、GBK)与 DOMDocument::loadHTML 默认行为不匹配。
HTTP header 与页面 meta 中的 charset 信息不一致。
解决办法:
在 loadHTML() 前把 HTML 转为 UTF-8(若不是),并在头部注入 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">,这样 DOMDocument 更易识别。
使用 mb_detect_encoding() 判断编码并转换为 UTF-8。
示例:
<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">normalize_to_utf8</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$html</span></span></span><span>) {
</span><span><span class="hljs-comment">// 尝试通过 BOM 或 meta 判断编码,若不确定则用 mb_detect_encoding</span></span><span>
</span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-literal">null</span></span><span>;
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">preg_match</span></span><span>(</span><span><span class="hljs-string">'/<meta.+?charset=["\']?\s*([a-zA-Z0-9\-\_]+)\b/i'</span></span><span>, </span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-variable">$m</span></span><span>)) {
</span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-title function_ invoke__">strtoupper</span></span><span>(</span><span><span class="hljs-variable">$m</span></span><span>[</span><span><span class="hljs-number">1</span></span><span>]);
}
</span><span><span class="hljs-keyword">if</span></span><span> (!</span><span><span class="hljs-variable">$encoding</span></span><span>) {
</span><span><span class="hljs-variable">$encoding</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_detect_encoding</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, [</span><span><span class="hljs-string">'UTF-8'</span></span><span>,</span><span><span class="hljs-string">'GB2312'</span></span><span>,</span><span><span class="hljs-string">'GBK'</span></span><span>,</span><span><span class="hljs-string">'ISO-8859-1'</span></span><span>], </span><span><span class="hljs-literal">true</span></span><span>);
}
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-variable">$encoding</span></span><span> && </span><span><span class="hljs-title function_ invoke__">strtoupper</span></span><span>(</span><span><span class="hljs-variable">$encoding</span></span><span>) !== </span><span><span class="hljs-string">'UTF-8'</span></span><span>) {
</span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">mb_convert_encoding</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-string">'UTF-8'</span></span><span>, </span><span><span class="hljs-variable">$encoding</span></span><span>);
}
</span><span><span class="hljs-comment">// 保证 loadHTML 识别为 UTF-8</span></span><span>
</span><span><span class="hljs-keyword">if</span></span><span> (</span><span><span class="hljs-title function_ invoke__">stripos</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-string">'<meta http-equiv="Content-Type"'</span></span><span>) === </span><span><span class="hljs-literal">false</span></span><span>) {
</span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">preg_replace</span></span><span>(</span><span><span class="hljs-string">'/<head([^>]*)>/i'</span></span><span>, </span><span><span class="hljs-string">'<head$1><meta http-equiv="Content-Type" content="text/html; charset=utf-8">'</span></span><span>, </span><span><span class="hljs-variable">$html</span></span><span>, </span><span><span class="hljs-number">1</span></span><span>);
}
</span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$html</span></span><span>;
}
</span></span>
原因:函数内部基于简单的解析器,遇到换行、注释、或在 content 属性中嵌套奇怪字符会失败。
解决办法:使用 DOMDocument 容错更好;或者先把 HTML 中头部做预处理(去除注释、压平属性到同一行)再调用 get_meta_tags()(不太优雅,但可作为短期补救)。
对策:
使用 CURLOPT_USERAGENT 设置常见浏览器 UA。
设置合理的 CURLOPT_TIMEOUT 与 CURLOPT_CONNECTTIMEOUT。
支持 CURLOPT_FOLLOWLOCATION(注意在某些环境中需要启用)。
若站点有反爬策略(验证码、JS 渲染、反机器人),考虑:
简单请求头伪装(但要遵守法律与网站 robots 协议)。
使用带 JS 的抓取工具(如 headless 浏览器),但这超出 PHP 原生范畴。
处理 HTTP 状态码并在失败时重试(指数回退),但避免过度请求。
示例:带 headers 的 curl:
<span><span><span class="hljs-title function_ invoke__">curl_setopt_array</span></span><span>(</span><span><span class="hljs-variable">$ch</span></span><span>, [
CURLOPT_RETURNTRANSFER => </span><span><span class="hljs-literal">true</span></span><span>,
CURLOPT_FOLLOWLOCATION => </span><span><span class="hljs-literal">true</span></span><span>,
CURLOPT_MAXREDIRS => </span><span><span class="hljs-number">5</span></span><span>,
CURLOPT_CONNECTTIMEOUT => </span><span><span class="hljs-number">10</span></span><span>,
CURLOPT_TIMEOUT => </span><span><span class="hljs-number">15</span></span><span>,
CURLOPT_USERAGENT => </span><span><span class="hljs-string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'</span></span><span>,
CURLOPT_HTTPHEADER => [
</span><span><span class="hljs-string">'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'</span></span><span>,
</span><span><span class="hljs-string">'Accept-Language: en-US,en;q=0.5'</span></span><span>,
],
]);
</span></span>
这是函数的设计:键名会被转为小写。如果你的业务依赖大小写敏感的字段,请注意标准化键名。
下面给出一个组合函数:先用 curl 取 HTML,再做编码归一化,最后用 DOM 解析并返回常见字段与所有 meta 列表。
<span><span><span class="hljs-function"><span class="hljs-keyword">function</span></span></span><span> </span><span><span class="hljs-title">fetch_page_info</span></span><span>(</span><span><span class="hljs-params"><span class="hljs-variable">$url</span></span></span><span>) {
</span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">fetch_html</span></span><span>(</span><span><span class="hljs-variable">$url</span></span><span>);
</span><span><span class="hljs-variable">$html</span></span><span> = </span><span><span class="hljs-title function_ invoke__">normalize_to_utf8</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>);
</span><span><span class="hljs-variable">$data</span></span><span> = </span><span><span class="hljs-title function_ invoke__">parse_meta_and_title</span></span><span>(</span><span><span class="hljs-variable">$html</span></span><span>);
</span><span><span class="hljs-comment">// 常见字段规范化:title, keywords, description</span></span><span>
</span><span><span class="hljs-variable">$info</span></span><span> = [];
</span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'title'</span></span><span>] = </span><span><span class="hljs-variable">$data</span></span><span>[</span><span><span class="hljs-string">'title'</span></span><span>] ?? </span><span><span class="hljs-literal">null</span></span><span>;
</span><span><span class="hljs-variable">$meta</span></span><span> = </span><span><span class="hljs-variable">$data</span></span><span>[</span><span><span class="hljs-string">'meta'</span></span><span>] ?? [];
</span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'keywords'</span></span><span>] = </span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'keywords'</span></span><span>] ?? (</span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'og:site_name'</span></span><span>] ?? </span><span><span class="hljs-literal">null</span></span><span>);
</span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'description'</span></span><span>] = </span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'description'</span></span><span>] ?? (</span><span><span class="hljs-variable">$meta</span></span><span>[</span><span><span class="hljs-string">'og:description'</span></span><span>] ?? </span><span><span class="hljs-literal">null</span></span><span>);
</span><span><span class="hljs-comment">// 返回所有 meta 以便进一步使用</span></span><span>
</span><span><span class="hljs-variable">$info</span></span><span>[</span><span><span class="hljs-string">'meta_all'</span></span><span>] = </span><span><span class="hljs-variable">$meta</span></span><span>;
</span><span><span class="hljs-keyword">return</span></span><span> </span><span><span class="hljs-variable">$info</span></span><span>;
}
</span><span><span class="hljs-comment">// 使用例子:</span></span><span>
</span><span><span class="hljs-keyword">try</span></span><span> {
</span><span><span class="hljs-variable">$url</span></span><span> = </span><span><span class="hljs-string">'https://example.com'</span></span><span>;
</span><span><span class="hljs-variable">$info</span></span><span> = </span><span><span class="hljs-title function_ invoke__">fetch_page_info</span></span><span>(</span><span><span class="hljs-variable">$url</span></span><span>);
</span><span><span class="hljs-title function_ invoke__">var_export</span></span><span>(</span><span><span class="hljs-variable">$info</span></span><span>);
} </span><span><span class="hljs-keyword">catch</span></span><span> (</span><span><span class="hljs-built_in">Exception</span></span><span> </span><span><span class="hljs-variable">$e</span></span><span>) {
</span><span><span class="hljs-keyword">echo</span></span><span> </span><span><span class="hljs-string">"Error: "</span></span><span> . </span><span><span class="hljs-variable">$e</span></span><span>-></span><span><span class="hljs-title function_ invoke__">getMessage</span></span><span>();
}
</span></span>
若需要批量抓取大量页面,请不要每次都实时抓取同一 URL。建议使用缓存(Redis、Memcached 或文件缓存)并设置合适的过期策略,例如 1 小时或 24 小时,视页面更新频率而定。
并发抓取时控制并发数,避免被目标站点封禁或自己主机压力过高。
对于大型站点优先抓取首页 & 重要页面,避免盲目抓取所有链接。
meta tags 写法不统一:很多现代站点使用 og:title、twitter:title,这些都不在 get_meta_tags() 的目标范围内,使用 DOM 能一次性抓取全部类型。
meta 标签重复:如果页面中出现多个同名 meta(可能用于多语言或版本控制),你的解析逻辑应决定是取第一个、合并还是全部保存。
meta 中的 HTML 实体:注意对 &、{ 等实体进行解码(html_entity_decode())。
robots/meta-refresh:如果需要处理 meta refresh(重定向)或 robots noindex,请专门检查 http-equiv 和相应属性。
遵守 robots.txt 与法律:抓取前请检查目标站点 robots.txt 与服务条款,尊重隐私与版权,不要抓取受限制内容。
确认你要抓取的是 <meta name="keywords"> 还是 <title>(两者不同工具)。
如果是远程抓取:先用 curl 获取并打印原始 HTML,查看 meta 的具体写法与编码。
检查 charset,若非 UTF-8,先转换再解析。
若 get_meta_tags() 无法提取,切换到 DOMDocument,并同时捕获 name、property 与 http-equiv。
处理 HTTP 错误、重定向与反爬机制(适当设置 UA、超时与重试策略)。
对重要页面实现缓存,避免重复请求。
get_meta_tags() 简单易用,但只适用于标准且简单的 meta name="..." 场景。它不会抓取 <title> 或 property 类型的 meta。
面对复杂、非标准或非 UTF-8 的页面,推荐使用 curl + DOMDocument 的组合:更灵活、鲁棒性更高。
编码、远程请求失败、反爬、meta 写法不规范是常见故障点,按照上文的排查顺序即可定位并修复大部分问题。
若需处理需要 JS 渲染的页面(SPA、动态加载 meta),则需要使用 headless 浏览器或服务器端渲染的方案(超出 PHP 原生范围)。