scraping

How to set proxy in node-fetch

Executing http(s) request via proxy might be helpful in a lot of cases, this helps to make your http request look like it was executed from a different country or location. Setting proxy in node-fetch Node.js package is not simple as in Axios (where we can set a proxy by passing simple JS object with options), in node-fetch we need to pass an Agent with proxy set up, so it is a bit more manual work. But, this is also a good thing, because we can use latest and greatest proxy package from npm fo

2 min read

Web scraping in Javascript: node-fetch vs axios vs got vs superagent

There is a number of ways to perform web requests in Node.js: node-fetch, axios, got, superagent Node.js can perform HTTP requests without additional packages While I don't ever use this approach because of it's poor developer ergonomics (using EventEmitter to collect the response data is just too verbose for me), Node.js is perfectly capable of sending HTTP requests without any libraries from npm! const https = require('https'); https.get('https://example.com/some-page', (resp) => { let

5 min read
Puppeteer: click an element and get raw JSON from XHR/AJAX response

Puppeteer: click an element and get raw JSON from XHR/AJAX response

This lately became a pretty popular question when scraping with Puppeteer: let's say you want to interact with the page (e.g. click the button) and retrieve the raw ajax response (usually, JSON).  Why would you want to do this? This is actually an interesting "hybrid" approach to extracting data from the web page - while we interact with the page like a real browser, we still do not mess around with the usual DOM traversing process to extract the data, and we grab raw JSON server response instea

5 min read
Puppeteer API service for web scraping

Puppeteer API service for web scraping

Okay, let's admit it - web scraping via Puppeteer and Playwright is the most versatile and flexible way of web scraping nowadays. Unfortunately it's also the most cumbersome, time consuming way of scraping, and sometimes it feels a little bit like voodoo magic. This is a post about my long & painful journey on how I was taming real Chrome browser, controlled programmatically via Puppeteer, for my web scraping needs. First and foremost: do not use real Chrome browser for scraping unless it is a

10 min read
How to bypass CloudFlare 403 (code:1020) errors [UPDATED 2023]

How to bypass CloudFlare 403 (code:1020) errors [UPDATED 2023]

I've recently started getting Cloudflare 1020 (403) errors when scraping some random e-commerce website. At first, I thought that the website didn't like my scraper IP address, but changing IP addresses to a clean residential proxy and even my home network didn't fix the issue. Strangely, when the website was opened in Chrome, it opened without any problems. I've opened Chrome Dev tools and did "Copy as cURL" operation from the Network tab, exactly how I always do it when debugging the scraping

7 min read