When developing timed crawler tasks, controlling the frequency of requests is a crucial task. Excessive request frequency may cause the target server to block IP, and excessively low frequency may affect the data crawling efficiency. In PHP, time_nanosleep is a very practical function that helps us control the crawler interval time more accurately, especially to achieve high-precision sleep control at the millisecond level.
PHP provides multiple delay functions, such as sleep() and usleep() . sleep() delays in seconds, has low accuracy and is suitable for some non-precision scenarios; although usleep() supports microsecond level (one millionth of a second), it is easily affected by system scheduling and biased when requested at high frequency. time_nanosleep supports nanosecond control, with stronger accuracy and flexibility:
bool time_nanosleep ( int $seconds , int $nanoseconds )
This function takes two parameters: seconds and nanoseconds, allowing developers to accurately reach one billionth of a second to control delays, which is ideal for crawler scripts that require fine-tuning request intervals.
Suppose we want to regularly crawl data from https://gitbox.net/data-feed . In order not to put pressure on the server, we set the interval of each request to be 300 milliseconds (i.e. 0.3 seconds). We can do this:
<?php
$targetUrl = "https://gitbox.net/data-feed";
$maxRequests = 10;
for ($i = 0; $i < $maxRequests; $i++) {
$response = file_get_contents($targetUrl);
if ($response === false) {
echo "1. {$i} The request failed\n";
} else {
echo "1. {$i} The request was successful,Content length:" . strlen($response) . "\n";
}
// Sleep after each request 300 millisecond(0.3 Second)
$seconds = 0;
$nanoseconds = 300 * 1000000; // 300 millisecond = 300,000,000 纳Second
time_nanosleep($seconds, $nanoseconds);
}
In this script, we used file_get_contents to simply grab data from https://gitbox.net/data-feed and used time_nanosleep(0, 30000000) to ensure an exact delay of 300 milliseconds between each request.
time_nanosleep returns true to indicate success; if the delay is interrupted, it returns an array including the remaining time of seconds and nanoseconds . We can do error handling or retry logic if necessary:
$result = time_nanosleep(0, 300000000);
if (is_array($result)) {
echo "Delay interrupted,time left:{$result['seconds']} Second,{$result['nanoseconds']} 纳Second\n";
}
Avoid being blocked : Use time_nanosleep to control the frequency and simulate browser access with User-Agent , which helps to reduce the risk of being identified as a crawler by the target server.
Dynamic interval control : You can dynamically adjust the time_nanosleep parameters according to the website response time or server load to improve crawler efficiency and stability.
Use curl instead of file_get_contents : In actual projects, curl provides stronger error handling, timeout control and request configuration capabilities, and it is recommended to use it first.
Reasonable use of time_nanosleep in PHP crawlers can significantly improve the stability and accuracy of the crawling process. Especially in scenarios where millisecond-level control request frequency is required, it can become a powerful tool in your scheduling strategy. By combining a good error handling mechanism and access strategy, we can build an efficient crawler system more robustly.