With the rapid development of the internet, the amount of information on web pages has grown exponentially. How to efficiently and accurately capture the needed data has become a key challenge for developers. PHP, as a popular web development language, combined with the powerful phpSpider crawler framework, provides great convenience for data extraction.
This article will guide you step-by-step on how to quickly build a crawler and extract target web page content using PHP and phpSpider.
First, you need to install phpSpider, a high-performance PHP-based crawler framework. Use Composer to install by running:
<span class="fun">composer require php-spider/phpspider</span>
Create a file named spider.php and include phpSpider's autoload file:
<?php
require 'vendor/autoload.php';
<p>use phpspider\core\phpspider;</p>
<p>// Create crawler instance<br>
$spider = new phpspider();</p>
<p>// Set the starting URL<br>
$spider->add_start_url('<a rel="noopener" target="_new" class="" href="http://www.example.com">http://www.example.com</a>');</p>
<p>// Define the page extraction callback<br>
$spider->on_extract_page = function($page, $data) {<br>
// Write extraction logic here<br>
return $data;<br>
};</p>
<p>// Start crawling<br>
$spider->start();<br>
The code above initializes the crawler, sets the starting URL, and defines a callback for processing extracted page data.
Within the callback, use regular expressions, XPath, or CSS selectors to locate elements. For example, extract the page title and main content as follows:
$spider->on_extract_page = function($page, $data) {
// Get the title
$title = $page['raw']['headers']['title'][0];
// Get the page content
$content = $page['raw']['content'];
$data['title'] = $title;
$data['content'] = strip_tags($content);
return $data;
};
Save the crawled data to a local file for further use:
$spider->on_extract_page = function($page, $data) {
$title = $page['raw']['headers']['title'][0];
$content = $page['raw']['content'];
$data['content'] = strip_tags($content);
// Append to text file
file_put_contents('extracted_data.txt', var_export($data, true), FILE_APPEND);
return $data;
};
After saving the code, execute in the terminal:
<span class="fun">php spider.php</span>
The crawler will start fetching data and extracting information based on your rules, saving the results accordingly.
Using PHP combined with the phpSpider framework, you can quickly build a powerful web crawler that automates data extraction. This article covered the core processes of installation, coding, data extraction, and result saving, making it easy for developers to get started. More advanced features can be configured flexibly depending on project needs to improve crawling efficiency and data quality.