Web scraping is a big and hot topic now, and PHP is a pretty fast language which is  convenient for rapid prototyping and is wildly popular across web developers. I have pretty extensive experience building complex scrapers in Node.js, but before this I spent a lot of years actively building big projects powered by PHP (a lot of these projects are still alive, proved to work great in long term, and are evolving). Hopefully, this simple tutorial will be useful for you!  

JS and non-JS scraping

It is important to see the difference between full-fledged browser which is used for scraping (Chrome Headless) and lower level scraping techniques when raw server responses are analyzed.

PHP is not a good fit if you need JS execution and generating human-like events on a website (like, clicking, scrolling to the bottom of the page to trigger loading of new content). Libraries like Puppeteer and Playwright are a much better fit for this, but you need to use Node.js as a programming language then. This approach is very flexible and allows to scrape almost any website, but scraping with real browser  has major cons as well: this is a much slower process, the scraper is overall less stable, and an order of magnitude more resource hungry. You need to dig deeper into Chrome Dev Inspector and see which exact network requests and responses are sent to the website to build a PHP "lower level" scraper, but if you manage to do this, chances are it will work much faster. This "non-JS" approach still works for a huge amount of modern websites and in a most of cases it is an order of magnitude faster than running real browser behind the curtains.

Scraping in PHP

Web scraping usually consists of two stages: retrieving the response from external website, and parsing the response. Higher level libraries might also help to orchestrate the bulk scraping tasks (like, scraping the whole website, executing parallel requests - I won't touch this topic in this article).

Retrieving the response in PHP

Real web scraping is simply not possible without proxies and retries. Proxy helps you to protect your server ip address from bans, and with retries you deal with unreliable nature of HTTP connection - proxy might get down, the target website might return 502 or 429 error (too many requests). It is also important to set proper request headers so target website deals with you as if you were real browser (this is not always possible to emulate all real browser features when doing scraping in PHP, I will explain all the workarounds below).

cURL

PHP cURL extension is still a very popular approach for getting response from external websites. This requires zero external dependencies and works pretty fast. The syntax is a bit verbose. Good list of cURL examples is available of official PHP website: https://www.php.net/manual/en/curl.examples.php

Here is a basic example of cURL scraping GET request with proxy:

$proxy = 'http://proxy-addr:port';
$proxyauth = 'username:pw';

// it is important to set real user agent
$agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36';

// create curl resource
$ch = curl_init();

// set url
curl_setopt($ch, CURLOPT_URL, "example.com");

//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyauth);

//Tell cURL that it should only spend 10 seconds
//trying to connect to the URL in question.
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);

//A given cURL operation should only take
//30 seconds max. By default, cURL waits indefinitely!
curl_setopt($ch, CURLOPT_TIMEOUT, 30);

curl_setopt($ch, CURLOPT_USERAGENT, $agent);

// $output contains the output string
$output = curl_exec($ch);

// close curl resource to free up system resources
curl_close($ch);     

If using CURL for scraping, it is  a good idea to use these two options:

  • CURLOPT_CONNECTTIMEOUT: The maximum amount of seconds that cURL should spend attempting to connect to a given URL.
  • CURLOPT_TIMEOUT: The maximum amount of seconds it should take for all cURL operations to be carried out. i.e. The maximum amount of time that the request should take. cURL will wait indefinitely if this is not specified.

Guzzle

Guzzle is certainly the second most popular way to retrieve HTTP responses. This is an external package which can be installed via composer: composer require guzzlehttp/guzzle

Guzzle might use CURL under the hood, it is an abstraction over CURL (but not only CURL). Guzzle generally helps you to write less code and do more with leaner syntax.

use GuzzleHttp\Client;

$proxyUrl = 'http://user:pw@proxy-host.com:8125';

// it is important to set real user agent
$agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36';

// Create a client with a base URI
$client = new GuzzleHttp\Client(['base_uri' => 'https://foo.com/api/']);

// Send a request to https://foo.com/api/test
$response = $client->request('GET', 'test', 
	'connect_timeout' => 10,
	'timeout' => 30, // Guzzle will wait indefinitely if not specified
    'headers' => [
        'User-Agent' => $agent
    ],
    'proxy' => [
            'http'  => $proxyUrl, // Use this proxy with "http"
            'https' => $proxyUrl, // Use this proxy with "https",
            'no' => ['.mit.edu', 'foo.com']    // Don't use a proxy with these
     ]
);

ScrapeNinja

I am the author of ScrapeNinja and ScrapeNinja PHP API Client (which is built on top of Guzzle). ScrapeNinja is a SaaS product with free plan and very minimalistic API surface, and it is much more than simple HTTP request-response project - it provides elegant solutions for most of web scraping issues: retries, proxies, real browser fingerprint emulation. Its killer feature is emulating TLS fingerprint of real browser, which is simply not possible with Guzzle and CURL - so, a lot of anti-scraping protections will mark you as bot even if you are setting all headers, like User-Agent, correctly.  Install ScrapeNinja with composer:

composer require restyler/scrapeninja-api-php-client
use ScrapeNinja\Client;

$scraper = new Client([
        "rapidapi_key" => "YOUR-RAPID-API-KEY" // get your key on https://rapidapi.com/restyler/api/scrapeninja
    ]
);

$response = $client->scrape([
    "url" => "https://news.ycombinator.com/", // target website URL
    "geo" => "us", // Proxy geo. eu, br, de, fr, 4g-eu, us proxy locations are available. Default: "us"
    "headers" => ["Some-custom-header: header1-val", "Other-header: header2-val"], // Custom headers to pass to target website. User-agent header is not required, it is attached automatically.
    "method" => "GET" // HTTP method to use. Default: "GET". Allowed: "GET", "POST", "PUT". 
]);

print_r($response);

Much more happening under the hood of this ScrapeNinja code, compared to cURL and Guzzle code:

  1. Retries are handled automatically. ScrapeNinja will do 3 attempts to retrieve the response, under the hood.
  2. You don't really need to set User-Agent header - ScrapeNinja will set this header automatically, though you always have an option to override this via headers property.
  3. You don't need to worry about proxies - just put geo: "us" and you are already using a pool of rotating, high speed proxies with US location ip!
  4. You get major killer feature of ScrapeNinja: Chrome TLS fingerprint. Read more why this is important in the detailed post.
  5. You get the flexibility - you can retry not just on certain HTTP response codes (via statusNotExpected: [403, 503, 429] property, but you can also retry on custom text found in the response body! just specify: textNotExpected: ["some-captcha-error-page-text"] (make sure this text does not appear in a good response!)

All these features does not mean ScrapeNinja is a big abstraction and it stands in your way. It is a pretty low-level tool which specializes only in one part of web scraping - retrieving the responses quickly and reliably.

Parsing the scraped response in PHP

After you got the response, you need to parse it. It is not easy to find a modern PHP HTML parsing library with convenient syntax. Of course, you can use DomDocument which is a fast PHP extension, but my opinion its UX is not perfect, since you can't easily use simple html selectors to choose elements from HTML.

DomDocument + DomXPath

$html = '
    <div class="page-wrapper">
        <section class="page single-review" itemtype="http://schema.org/Review" itemscope="" itemprop="review">
            <article class="review clearfix">
                <div class="review-content">
                    <div class="review-text" itemprop="reviewBody">
                    Outstanding ... 
                    </div>
                </div>
            </article>
        </section>
    </div>
';

$classname = 'review-text';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");

if ($results->length > 0) {
    echo $review = $results->item(0)->nodeValue;
}

Symfony DOM Crawler

https://symfony.com/doc/current/components/dom_crawler.html

This packages provides a more ergonomic & concise syntax to retrieve elements from HTML DOM tree. Technically it is just a syntax sugar over DomDocument and DomXPath, so I don't see a big reason to use raw DomDocument anymore. Make sure to check the official website examples.

use Symfony\Component\DomCrawler\Crawler;

$html = '
    <div class="page-wrapper">
        <section class="page single-review" itemtype="http://schema.org/Review" itemscope="" itemprop="review">
            <article class="review clearfix">
                <div class="review-content">
                    <div class="review-text" itemprop="reviewBody">
                    Outstanding ... 
                    </div>
                </div>
            </article>
        </section>
    </div>
';

$crawler = new Crawler($html); // put the string with HTML to $html

$crawler = $crawler
    ->filter('.review-text')
    ->reduce(function (Crawler $node, $i) {
        // filters every other node
        return ($i % 2) == 0;
    });

Higher level packages

These packages include both retrieval and parsing, and sometimes spider building functionality.

Goutte

Goutte is a higher-level wrapper around BrowserKit, DomCrawler, and HttpClient Symfony Components:

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'fabpot', 'password' => 'xxxxxx']);
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

Do not be confused: while it has "submit()" and "click()" methods this is not a full-fledged automated browser but just a syntax sugar around raw server responses.

Unfortunately, Goutte was not updated on Github for almost 6 months at the time of writing this article.

Scrapy alternative for PHP: roach-php

RoachPHP is a pretty young project which is a "shameless clone" of Scrapy, a very popular Python scraping framework.

This is a high level library which not only scrapes particular page, but can also crawl an entire website as a "spider" (so technically this is a crawler AND scraper framework).