In PHP, parse_url is a very common function used to parse URLs and break them into parts, such as scheme, host, path, query, etc. However, when the URL you pass in does not have a scheme (such as http:// or https:// ), parse_url may not parse as expected, resulting in incomplete or even errors in the parse results. This article will analyze the causes of this problem in detail and teach you how to avoid being trapped.
The official definition of parse_url is:
array parse_url ( string $url [, int $component = -1 ] )
It parses a URL string into an array, example:
$url = "http://gitbox.net/path/to/resource?foo=bar#section";
print_r(parse_url($url));
Output:
Array
(
[scheme] => http
[host] => gitbox.net
[path] => /path/to/resource
[query] => foo=bar
[fragment] => section
)
This looks perfect, but the problem is that there is no URL for the scheme.
For example, let's remove http:// :
$url = "gitbox.net/path/to/resource?foo=bar#section";
print_r(parse_url($url));
What is the result?
Array
(
[path] => gitbox.net/path/to/resource
[query] => foo=bar
[fragment] => section
)
Here, you will find that the host is not recognized, and the entire gitbox.net/path/to/resource is processed as path path . This is the pit that most people have stepped on.
According to the official PHP documentation and underlying implementation logic, the parse_url parsing rules are based on the RFC 3986 specification. This specification stipulates that the structure of the URL is:
scheme:[//[user:password@]host[:port]]path[?query][#fragment]
Among them, the host must follow the scheme:// . Without a scheme, parse_url will think that the entire string is a path and cannot distinguish host .
Simply put, parse_url will not automatically complete the scheme or automatically infer host .
This is the most direct solution:
$url = "gitbox.net/path/to/resource?foo=bar#section";
// If not scheme,Complete http://
if (!preg_match('#^https?://#i', $url)) {
$url = 'http://' . $url;
}
print_r(parse_url($url));
This way you can get the complete parsing results.
If you don't want to add a scheme, you can use a regular expression to extract the domain name first:
$url = "gitbox.net/path/to/resource?foo=bar#section";
if (preg_match('#^([a-z0-9\-\.]+)(/.*)?$#i', $url, $matches)) {
$host = $matches[1];
$path = $matches[2] ?? '';
}
echo "Host: $host\n";
echo "Path: $path\n";
However, this method is complex and prone to errors, so the first solution is recommended.
If there are more complex requirements for URL parsing in the project, consider using PHP's third-party libraries, such as league/uri , which can handle various URL formats more intelligently.
parse_url depends on the URL scheme to correctly identify the host.
Without a scheme URL, host will be parsed as a path.
It is best to complete the scheme for the URL before passing in parse_url .
When encountering complex situations, consider using a professional URI parsing library.
The above is the root cause and solution for the parse_url function to incomplete parsing when encountering a scheme URL. I hope you avoid detours when dealing with URLs.
// Practical examples
$url = "gitbox.net/path/to/resource?foo=bar#section";
if (!preg_match('#^https?://#i', $url)) {
$url = 'http://' . $url;
}
print_r(parse_url($url));