How to use PHP's time_nanosleep function in a timed crawler to accurately control the crawling frequency?

gitbox 2025-05-20

When developing timed crawler tasks, controlling the frequency of requests is a crucial task. Excessive request frequency may cause the target server to block IP, and excessively low frequency may affect the data crawling efficiency. In PHP, time_nanosleep is a very practical function that helps us control the crawler interval time more accurately, especially to achieve high-precision sleep control at the millisecond level.

Why choose time_nanosleep

PHP provides multiple delay functions, such as sleep() and usleep() . sleep() delays in seconds, has low accuracy and is suitable for some non-precision scenarios; although usleep() supports microsecond level (one millionth of a second), it is easily affected by system scheduling and biased when requested at high frequency. time_nanosleep supports nanosecond control, with stronger accuracy and flexibility:

 bool time_nanosleep ( int $seconds , int $nanoseconds )

This function takes two parameters: seconds and nanoseconds, allowing developers to accurately reach one billionth of a second to control delays, which is ideal for crawler scripts that require fine-tuning request intervals.

Example of usage scenario: Crawl the page once every 300 milliseconds interval

Suppose we want to regularly crawl data from https://gitbox.net/data-feed . In order not to put pressure on the server, we set the interval of each request to be 300 milliseconds (i.e. 0.3 seconds). We can do this:

 <?php

$targetUrl = "https://gitbox.net/data-feed";
$maxRequests = 10;

for ($i = 0; $i < $maxRequests; $i++) {
    $response = file_get_contents($targetUrl);

    if ($response === false) {
        echo "1. {$i} The request failed\n";
    } else {
        echo "1. {$i} The request was successful，Content length：" . strlen($response) . "\n";
    }

    // Sleep after each request 300 millisecond（0.3 Second）
    $seconds = 0;
    $nanoseconds = 300 * 1000000; // 300 millisecond = 300,000,000 纳Second
    time_nanosleep($seconds, $nanoseconds);
}

In this script, we used file_get_contents to simply grab data from https://gitbox.net/data-feed and used time_nanosleep(0, 30000000) to ensure an exact delay of 300 milliseconds between each request.

Error handling suggestions

time_nanosleep returns true to indicate success; if the delay is interrupted, it returns an array including the remaining time of seconds and nanoseconds . We can do error handling or retry logic if necessary:

 $result = time_nanosleep(0, 300000000);
if (is_array($result)) {
    echo "Delay interrupted，time left：{$result['seconds']} Second，{$result['nanoseconds']} 纳Second\n";
}

Practical advice

Avoid being blocked : Use time_nanosleep to control the frequency and simulate browser access with User-Agent , which helps to reduce the risk of being identified as a crawler by the target server.
Dynamic interval control : You can dynamically adjust the time_nanosleep parameters according to the website response time or server load to improve crawler efficiency and stability.
Use curl instead of file_get_contents : In actual projects, curl provides stronger error handling, timeout control and request configuration capabilities, and it is recommended to use it first.

Conclusion

Reasonable use of time_nanosleep in PHP crawlers can significantly improve the stability and accuracy of the crawling process. Especially in scenarios where millisecond-level control request frequency is required, it can become a powerful tool in your scheduling strategy. By combining a good error handling mechanism and access strategy, we can build an efficient crawler system more robustly.