webscraping

Web scraping is a topic in which I am deeply interested. I have completed numerous successful web scraping projects and have launched several products in this field. Most of these projects were developed using Node.js, Puppeteer, cURL, and Playwright.

How do download PDF in Playwright

In the ever-evolving world of web scraping, I often come across hurdles that require creative solutions and some quick code workarounds and hacks - and oh boy! this is especially true when I am working with programmatically driven browsers, which I happen to do a lot lately. Today, I'd like to share a challenge I faced while trying to download PDF files using Playwright, and how I managed to overcome it. The Unexpected Twist with Chromium and Playwright Initially, after quickly browsing Playw

3 min read
Choosing a proxy for web scraping

Choosing a proxy for web scraping

Once you're familiar with basic web scraping tools like Scrapy, and you've scraped your first 1-2 websites, you'll probably get your first ban because your IP address has made too many requests (what "too many" means really depends on the site, for one site it's just 3 requests per hour, for another site it's 100 requests in a 5 minute window).  It's important to make sure that the site ban is actually related to an ip address from which you're sending your requests: to check that it's not a coo

10 min read

How to set proxy in Puppeteer: 3 ways

Puppeteer is an incredibly useful tool for automating web browsers. It allows to run headless (or non-headless) Chrome instances, automatically interacting with websites and pages in ways that would normally require manual input from a user or other scripts. In a lot of cases (particularly in web scraping tasks) it is required for HTTP requests to look like they originate from different IPs or networks than your server running Puppeteer – and this is where proxies come into play. In this blog po

5 min read

How to set proxy in node-fetch

Executing http(s) request via proxy might be helpful in a lot of cases, this helps to make your http request look like it was executed from a different country or location. Setting proxy in node-fetch Node.js package is not simple as in Axios (where we can set a proxy by passing simple JS object with options), in node-fetch we need to pass an Agent with proxy set up, so it is a bit more manual work. But, this is also a good thing, because we can use latest and greatest proxy package from npm fo

2 min read

Web scraping in Javascript: node-fetch vs axios vs got vs superagent

There is a number of ways to perform web requests in Node.js: node-fetch, axios, got, superagent Node.js can perform HTTP requests without additional packages While I don't ever use this approach because of it's poor developer ergonomics (using EventEmitter to collect the response data is just too verbose for me), Node.js is perfectly capable of sending HTTP requests without any libraries from npm! const https = require('https'); https.get('https://example.com/some-page', (resp) => { let

5 min read