I've recently started getting Cloudflare 1020 (403) errors when scraping some random e-commerce website. At first, I thought that the website didn't like my scraper IP address, but changing IP addresses to clean residential proxy and even my home network didn't fix the issue. Curiously, when the website was opened in Chrome, it opened without any issues. I've opened Chrome Dev tools and did "copy as CURL" operation from the Network tab, exactly how I always do it when debugging the scraping process.

The identical request from CURL, from my home network, triggered the 403 response from Cloudflare. So, this was something interesting going on!

Quick googling helped to discover a number of issues pretty much like this - the request works fine in real browser but fails when launched from python/node.js/curl. My curiosity was over the top. After another hour of googling the picture starts to get some necessary details: apparently, CloudFlare started to deploy the TLS/SSL handshake analysis techniques as a primary tool for their anti-scraping website protection.

https://blog.cloudflare.com/monsters-in-the-middleboxes/

This writeup is not exactly on topic of preventing scraping, but it sheds some light on the fact that Cloudflare is now gathering TLS fingerprint statistics and these stats might be (and are) used to fight scrapers.

What is TLS fingerprinting?

First of all, just in case you don't know, here is what TLS means, this will be a very simple and quick explanation. Well, TLS (Transport Layer Security) is the technology which is used under the hood for each https connection from some client (browser, or curl) to some website.

Back in the days of plain http protocol, there was no such layer. Now, it's hard to find a website which uses http:// address by default - all major websites are using https:// which, again, uses TLS protocol (and SSL, before it was deprecated). This is great news for everyone, because it makes a lot of man-in-the-middle attacks very hard to do. But it also provides interesting ways to retrieve unique client fingerprint. While we don't know for sure which method Cloudflare uses to detect TLS fingeprint, the most popular method is JA3.

JA3 - A method for profiling SSL/TLS Clients    

Copy&pasting description from JS3 GitHub page:

https://github.com/salesforce/ja3/

To initiate a SSL(TLS) session, a client will send a SSL Client Hello packet following the TCP 3-way handshake. This packet and the way in which it is generated is dependant on packages and methods used when building the client application. The server, if accepting SSL connections, will respond with a SSL Server Hello packet that is formulated based on server-side libraries and configurations as well as details in the Client Hello. Because SSL negotiations are transmitted in the clear, it’s possible to fingerprint and identify client applications using the details in the SSL Client Hello packet.

Confirming TLS protection

To be 100% sure this is some kind of fingerprint-based protection, I've set up a simple Puppeteer.js script which opens the website and dumps it html:

const puppeteer = require('puppeteer');

(async function main() {
    try {
        const browser = await puppeteer.launch();
        const [page] = await browser.pages();

        await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
        const bodyHTML = await page.evaluate(() => document.body.innerHTML);

        console.log(bodyHTML);

        await browser.close();
    } catch (err) {
        console.error(err);
    }
})();

Inspecting bodyHTML constant.. still code:1020 ! Not so easy.. let's make sure our puppeteer sets proper user agent:

Puppeteer.js in stealth mode

Of course we can just set user-agent manually in puppeteer initalization code, but there is a better and more reliable way:

npm i puppeteer-extra puppeteer-extra-plugin-stealth

let's modify the code so it uses stealth plugin:

// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

(async function main() {
    try {
        const browser = await puppeteer.launch();
        const [page] = await browser.pages();

        await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
        const bodyHTML = await page.evaluate(() => document.body.innerHTML);

        console.log(bodyHTML);

        await browser.close();
    } catch (err) {
        console.error(err);
    }
})();

launching... now this give us 200 response, perfect! The page opens fine and Puppeteer does not trigger CloudFlare 403 error page.

Bypassing TLS fingerpint protection. Rapidly.

Using Puppeteer.js is a perfectly viable way to bypass the protection, but here is a joke for you - if you have a problem, and you try to solve it with Puppeteer, you are now having two problems. (Okay, originally this joke was about regular expressions, but I think Puppeteer fits here even better). Puppeteer, while being an incredible piece of software, is a huge mammoth - it is insanely resource hungry, slow, and error prone. For my use-case, Puppeteer was a big overkill, and I didn't want to have RAM management issues when doing scraping. So, I needed to find an alternative. An alternative was found - I had to use BoringSSL ( https://boringssl.googlesource.com/boringssl/ ) which is used by Chrome and Chromium projects, to build a curl-like utility. If we use the same library as Chrome uses, under the hood, it's highly likely our TLS fingerpring will be the same, right?

It was not an easy task, since I have zero knowledge of C/C++, and BoringSSL and Chromium codebases are massive. But, this also was the reason why it was super exciting to finally get a compiled utility which can be run pretty much like curl:

curlninja https://example.com  -H "X-Random-Header: header" -o output.html

And when it turned out to work, and didn't trigger CloudFlare protection, while being 3x-5x faster than Puppeteer (while requiring, probably, an order of magnitude less RAM and CPU) - I was ecstatic. Digging deep, exploring new areas, and getting the result is what makes me happy as a developer.

Sharing the solution with the world: ScrapeNinja

Compiling this again is a no-go for most of web devs, so the obvious way to allow others to try this utility was to wrap it into a Node.js API server, which is now available via RapidAPI as a Cloud SaaS solution:

https://rapidapi.com/restyler/api/scrapeninja

ScrapeNinja was just released and it will probably evolve a lot during upcoming months, but it is already a finished MVP (Minimum Viable Product) and something I am pretty proud of:

  1. It has proxy management under the hood, and uses non-datacenter US-based IP address ranges ( I will probably extend the pool with dynamic residential proxies, and add more countries soon ).
  2. It is pretty fast - around 0.8-2 sec for an average html page.
  3. It allows to pass custom headers (and cookies) to the target website.
  4. It has a simple, yet nice, logo (which I created in Figma):

For now, only GET HTTP method is supported by ScrapeNinja, but I will probably add POST soon. Try it - I will appreciate your feedback!

Cloudflare has more than just one technique to fight scrapers and bots

It is important to understand that ScrapeNinja tries to bypass only one Cloudflare protection layer now (well, two, if we consider that it uses clean residential proxies). Cloudflare also has JS challenge screens under its belt - this type of protection may show hcaptcha, or it may just require some JS computations before you can access the target website. This kind of protection is much slower and annoying for end users so usually only heavily DDoSed websites activate it. This kind of protection shows text similar to this:

Checking your browser before accessing example.com.

This process is automatic. Your browser will redirect to your requested content shortly.

Please allow up to 5 seconds...

The subsequent requests to the same website already have special cookie attached by JS code executed on end user browser side, so they are pretty fast.

For this type of protection, various Github solutions may be used, for example

https://github.com/VeNoMouS/cloudscraper and  https://github.com/RyuzakiH/CloudflareSolverRe

Scraping is indeed a cat&mouse game, and website protections evolve all the time. It's a pretty exciting game to be in, as well!

Don't forget to be ethical while scraping and never overload the target website with big amount of requests.