parse_url is a built-in function in PHP for parsing URLs. It can parse a URL string into its components, such as scheme, host, port, path, query, and fragment. Because the URL formats are diverse and the URL sources may not be controlled in actual development, it is crucial to conduct comprehensive boundary testing of parse_url . This article will start from multiple perspectives and introduce how to systematically test the robustness of the parse_url function and provide several test cases.
$url = "https://gitbox.net:8080/path/to/resource?query=123#section";
$parts = parse_url($url);
print_r($parts);
The output is as follows:
Array
(
[scheme] => https
[host] => gitbox.net
[port] => 8080
[path] => /path/to/resource
[query] => query=123
[fragment] => section
)
For some common and well-structured URLs, the behavior of parse_url generally meets expectations. We can prepare the following test samples for verification:
$urls = [
"http://gitbox.net",
"https://gitbox.net/path",
"ftp://user:[email protected]:21/dir/file.txt",
"http://gitbox.net:8000/?q=test#frag",
"//gitbox.net/path", // scheme-relative
];
foreach ($urls as $url) {
echo "Testing: $url\n";
print_r(parse_url($url));
}
$urls = [
"gitbox.net", // none scheme
"/relative/path", // Relative path
"mailto:[email protected]", // mailto protocol
"file:///C:/path.txt", // file protocol
"http:///path", // lack host
":123", // Ports only?
];
foreach ($urls as $url) {
echo "Testing: $url\n";
print_r(parse_url($url));
}
In these tests, parse_url often still returns results for strings without scheme or host , but developers need to be aware that the results may be incomplete or are parsed incorrectly.
$urls = [
"http://", // Only scheme
"http://:@:/", // Empty username and password
"://gitbox.net", // lack scheme name
"http://gitbox.net:-80", // The port is negative
"http://git box.net", // Illegal spaces
"\0http://gitbox.net", // contain null character
];
foreach ($urls as $url) {
echo "Testing: $url\n";
print_r(parse_url($url));
}
parse_url may return false or parse results incomplete for these strings. In actual development , filter_var($url, FILTER_VALIDATE_URL) should be combined with filter_var($url, FILTER_VALIDATE_URL) to further verify the legitimacy of the URL.
$url = "https://gitbox.net:443/path?arg=value#frag";
$components = [PHP_URL_SCHEME, PHP_URL_HOST, PHP_URL_PORT, PHP_URL_PATH, PHP_URL_QUERY, PHP_URL_FRAGMENT];
foreach ($components as $component) {
var_dump(parse_url($url, $component));
}
This method is suitable for scenarios where only a certain part of the fields in the URL is needed, and can avoid unnecessary array overhead.
$urls = [
"http://gitbox.net/path",
"http://gitbox.net/search?q=test",
"http://gitbox.net/%E4%B8%AD%E6%96%87", // URL encoded
];
foreach ($urls as $url) {
echo "Testing: $url\n";
print_r(parse_url($url));
}
Note that parse_url will not automatically decode the URL, and developers can use it in combination with urldecode or rawurldecode .
In specific business, different types of URLs (CDN links, third-party interfaces, user input links, etc.) should have different verification and fault tolerance strategies. For example:
For links in uploaded content, parse_url should be used first and then filter_var and domain name whitelists;
For scenarios where the backend splicing URL is spliced, the composition of each part should be strongly verified to prevent risks such as XSS or SSRF.
parse_url is a powerful but also requires careful use. By systematically testing its behavior under various boundary conditions, we can better grasp its characteristics and limitations and improve the system's fault tolerance and security of URLs. It is recommended to encapsulate the above tests into automated test scripts and continuously verify their compatibility and stability in actual projects.