How to bypass CloudFlare 403 (code:1020) errors [UPDATED 2023]

I've recently started getting Cloudflare 1020 (403) errors when scraping some random e-commerce website. At first, I thought that the website didn't like my scraper IP address, but changing IP addresses to a clean residential proxy and even my home network didn't fix the issue. Strangely, when the website was opened in Chrome, it opened without any problems. I've opened Chrome Dev tools and did "Copy as cURL" operation from the Network tab, exactly how I always do it when debugging the scraping process.

The identical request from CURL, from my home network, triggered the 403 response from Cloudflare. So, this was something interesting going on!

Quick googling helped me to discover a number of similar problems - the request works fine in real browser but fails when launched from python/node.js/curl. My curiosity was over the top. After another hour of googling the picture starts to get some necessary details: apparently, CloudFlare has started to use the TLS/SSL handshake analysis techniques as a primary tool for their anti-scraping website protection.

https://blog.cloudflare.com/monsters-in-the-middleboxes/

This writeup is not exactly on topic of preventing scraping, but it sheds some light on the fact that Cloudflare is now gathering TLS fingerprint statistics and these stats might be (and are) used to fight scrapers.

What is TLS fingerprinting?

First of all, just in case you don't know, here is what TLS means, this will be a very simple and quick explanation. Well, TLS (Transport Layer Security) is the technology which is used under the hood for each https connection from some client (browser, or curl) to some website.

Back in the days of plain http protocol, there was no such layer. Now, it's hard to find a website which uses http:// address by default - all major websites are using https:// which, again, uses TLS protocol (and SSL, before it was deprecated). This is great news for everyone, because it makes a lot of man-in-the-middle attacks very hard to do. But it also provides interesting ways to retrieve unique client fingerprint. While we don't know for sure which method Cloudflare uses to detect TLS fingeprint, the most popular method is JA3.

JA3 - A method for profiling SSL/TLS Clients    

Copy&pasting description from JS3 GitHub page:

https://github.com/salesforce/ja3/

To initiate a SSL(TLS) session, a client will send a SSL Client Hello packet following the TCP 3-way handshake. This packet and the way in which it is generated is dependant on packages and methods used when building the client application. The server, if accepting SSL connections, will respond with a SSL Server Hello packet that is formulated based on server-side libraries and configurations as well as details in the Client Hello. Because SSL negotiations are transmitted in the clear, it’s possible to fingerprint and identify client applications using the details in the SSL Client Hello packet.

Confirming TLS protection

To be 100% sure this is some kind of fingerprint-based protection, I've set up a simple Puppeteer.js script which opens the website and dumps it html:

const puppeteer = require('puppeteer');

(async function main() {
    try {
        const browser = await puppeteer.launch();
        const [page] = await browser.pages();

        await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
        const bodyHTML = await page.evaluate(() => document.body.innerHTML);

        console.log(bodyHTML);

        await browser.close();
    } catch (err) {
        console.error(err);
    }
})();

Inspecting bodyHTML constant.. still code:1020 ! Not so easy.. let's make sure our puppeteer sets proper user agent:

Puppeteer.js in stealth mode

Of course we can just set user-agent manually in puppeteer initalization code, but there is a better and more reliable way:

npm i puppeteer-extra puppeteer-extra-plugin-stealth

let's modify the code so it uses stealth plugin:

// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

(async function main() {
    try {
        const browser = await puppeteer.launch();
        const [page] = await browser.pages();

        await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
        const bodyHTML = await page.evaluate(() => document.body.innerHTML);

        console.log(bodyHTML);

        await browser.close();
    } catch (err) {
        console.error(err);
    }
})();

launching... now this give us 200 response, perfect! The page opens fine and Puppeteer does not trigger CloudFlare 403 error page.

Bypassing TLS fingerpint protection. Rapidly.

Using Puppeteer.js is a perfectly viable way to bypass the protection, but here is a joke for you - if you have a problem, and you try to solve it with Puppeteer, you are now having two problems. (Okay, originally this joke was about regular expressions, but I think Puppeteer fits here even better). Puppeteer, while being an incredible piece of software, is a huge mammoth - it is insanely resource hungry, slow, and error prone. For my use-case, Puppeteer was a big overkill, and I didn't want to have RAM management issues when doing scraping. So, I needed to find an alternative. An alternative was found - I had to use BoringSSL ( https://boringssl.googlesource.com/boringssl/ ) which is used by Chrome and Chromium projects, to build a curl-like utility. If we use the same library as Chrome uses, under the hood, it's highly likely our TLS fingerpring will be the same, right?

It was not an easy task, since I have zero knowledge of C/C++, and BoringSSL and Chromium codebases are massive. But, this also was the reason why it was super exciting to finally get a compiled utility which can be run pretty much like curl:

curlninja https://example.com  -H "X-Random-Header: header" -o output.html

And when it turned out to work, and didn't trigger CloudFlare protection, while being 3x-5x faster than Puppeteer (while requiring, probably, an order of magnitude less RAM and CPU) - I was ecstatic. Digging deep, exploring new areas, and getting the result is what makes me happy as a developer.

Sharing the solution with the world: ScrapeNinja

Compiling this again is a no-go for most of web devs, so the obvious way to allow others to try this utility was to wrap it into a Node.js API server, which is now available via RapidAPI as a Cloud SaaS solution:

https://rapidapi.com/restyler/api/scrapeninja

ScrapeNinja was just released and it will probably evolve a lot during upcoming months, but it is already a finished MVP (Minimum Viable Product) and something I am pretty proud of:

  1. It has proxy management under the hood, and uses non-datacenter US-based IP address ranges ( I will probably extend the pool with dynamic residential proxies, and add more countries soon ).
  2. It is pretty fast - around 0.8-2 sec for an average html page.
  3. It allows to pass custom headers (and cookies) to the target website.
  4. It has a simple, yet nice, logo (which I created in Figma):

For now, only GET HTTP method is supported by ScrapeNinja, but I will probably add POST soon. Try it - I will appreciate your feedback!

(UPD: POST & PUT methods have been added to ScrapeNinja!)

Cloudflare has more than just one technique to fight scrapers and bots

It is important to understand that ScrapeNinja tries to bypass only one Cloudflare protection layer now (well, two, if we consider that it uses clean residential proxies). Cloudflare also has JS challenge screens under its belt - this type of protection may show hcaptcha, or it may just require some JS computations before you can access the target website. This kind of protection is much slower and annoying for end users so usually only heavily DDoSed websites activate it. This kind of protection shows text similar to this:

Checking your browser before accessing example.com.

This process is automatic. Your browser will redirect to your requested content shortly.

Please allow up to 5 seconds...

The subsequent requests to the same website already have special cookie attached by JS code executed on end user browser side, so they are pretty fast.

For this type of protection, various Github solutions may be used, for example

https://github.com/VeNoMouS/cloudscraper and  https://github.com/RyuzakiH/CloudflareSolverRe

Scraping is indeed a cat&mouse game, and website protections evolve all the time. It's a pretty exciting game to be in, as well!

UPD: ScrapeNinja can now render Javascript websites by launching real browser via API, read more how it works in my writeup: https://pixeljets.com/blog/puppeteer-api-web-scraping/ - technically this means that even Javascript waiting screen of Cloudflare can be bypassed by ScrapeNinja (just specify proper waitForSelector to wait for target website content).

Disclaimer

  1. ScrapeNinja does not have any special mechanisms to bypass CloudFlare or abuse websites using it, it is basically just a 5x faster Puppeteer, which is already widely available, with disabled JS evaluation (`/scrape` endpoint).
  2. ScrapeNinja might be useful for scraping any website worldwide, it is a simple tool, use it with caution and respect applicable laws and target websites rules.
  3. Don't forget to be ethical while scraping and never overload the target website with big amount of requests.
  4. Any accounts with attempts to use ScrapeNinja for  illegal purposes or for websites abuse will be terminated.
Quick "how to" video about ScrapeNinja

UPD April 2022: ScrapeNinja major update

  • More rotating proxy geos (Germany, France, Brazil, 4g residential)
  • Better timeout handling
  • Smart retries based on status code and custom text in response body

Read more in April writeup: Never handle retries and proxies in your code again

UPD September 2022: ScrapeNinja major update

  • I have finally implemented real Chrome API in ScrapeNinja /scrape-js endpoint, now you can choose between basic and ultra-fast /scrape ScrapeNinja endpoint for basic tasks and new endpoint for complex tasks requiring full Javascript rendering
  • ScrapeNinja can now extract JSON data from HTML sources of scraped output, thanks to Extractors feature and ScrapeNinja Cheerio Online Sandbox which allows you to quickly write and test custom Javascript extraction code online, in browser (think regex101 editor, but for Cheerio)
  • Many smaller fixes and improvements!

UPD March 2023: ScrapeNinja was launched on Product Hunt!

ScrapeNinja got featured on Product Hunt and received "3rd Product of the Day" award which is a big honor for me, especially in times of AI hype cycle when it is so hard to get upvotes on a non-GPT product. Thank you for your support!

I am available for custom development & consulting projects.

Drop me a message via telegram.