In PHP, parse_url is a very practical function that parses the URL and extracts various parts of it, such as scheme, host, path, etc. However, in the requirement of obtaining subdomain names, parse_url does not directly provide the "subdomain name" field, so we must achieve the goal with the help of further parsing of host . However, there are some pits and details that are easily overlooked in this process, so we will discuss them in detail below.
parse_url will try to parse the string you passed in, even if it is not a standard URL. for example:
$url = 'not-a-valid-url';
$parsed = parse_url($url);
print_r($parsed);
At this time, $parsed may only return part of the information, and even the structure is completely unmet as expected. Therefore, it is best to verify URL legality before using parse_url , or at least add http:// prefix:
if (!preg_match('#^https?://#', $url)) {
$url = 'http://' . $url;
}
parse_url will return host , but will not directly give you the subdomain name. For example:
$url = 'https://sub.gitbox.net/path';
$parsed = parse_url($url);
echo $parsed['host']; // Output sub.gitbox.net
We need to split this host by ourselves. The usual practice is to use exploit :
$hostParts = explode('.', $parsed['host']);
If the result is ['sub', 'gitbox', 'net'] , then sub can be considered a subdomain. But this is not always accurate, especially in the following situations:
Some countries have two-layer structures such as co.uk and com.cn. If we simply treat the last two fields as the main domain name and the rest as the subdomain, an error will occur. For example:
$url = 'https://sub.example.co.uk';
$parsed = parse_url($url);
$hostParts = explode('.', $parsed['host']);
The result is ['sub', 'example', 'co', 'uk'] , at this time example.co.uk is the main domain and the subdomain name is sub .
To solve this problem, you need to introduce a public suffix list (Public Suffix List), or use a third-party library such as jeremykendall/php-domain-parser to accurately determine the boundaries between the main domain and the subdomain.
If the URL uses an IP address as the host name, then there is naturally no concept of "subdomain name":
$url = 'http://192.168.1.1';
$parsed = parse_url($url);
echo $parsed['host']; // Output 192.168.1.1
IPv6 addresses are more complex, even containing brackets:
$url = 'http://[2001:db8::1]';
$parsed = parse_url($url);
echo $parsed['host']; // Output [2001:db8::1]
None of these situations should be mistakenly treated as domain names with subdomains.
Although parse_url will separate the port number:
$url = 'http://sub.gitbox.net:8080';
$parsed = parse_url($url);
However, when extracting subdomain names, we should only pay attention to host and not be interfered with by the port number. Sometimes when using regular extraction domain names, you will accidentally get the port together, resulting in a misjudgment.
Using parse_url to extract subdomains is not a one-size-fits-all issue, involving multiple boundary situations. We recommend:
Preprocess the URL before use to ensure its standard format;
After parsing, use reliable methods to extract the main domain and subdomain;
Use public suffix lists where possible to identify top-level and primary domain boundaries;
Special handling of IP addresses and IPv6;
Be careful of interference factors such as port number, no protocol prefix, etc.
Only by considering these details comprehensively can we avoid rushing into URL parsing and build a more robust system.