webscraping

Web scraping is a topic in which I am deeply interested. I have completed numerous successful web scraping projects and have launched several products in this field. Most of these projects were developed using Node.js, Puppeteer, cURL, and Playwright.

Puppeteer API service for web scraping

Okay, let's admit it - web scraping via Puppeteer and Playwright is the most versatile and flexible way of web scraping nowadays. Unfortunately it's also the most cumbersome, time consuming way of scraping, and sometimes it feels a little bit like voodoo magic. This is a post about my long & painful journey on how I was taming real Chrome browser, controlled programmatically via Puppeteer, for my web scraping needs. First and foremost: do not use real Chrome browser for scraping unless it is a

How to do web scraping in PHP

Web scraping is a big and hot topic now, and PHP is a pretty fast language which is convenient for rapid prototyping and is wildly popular across web developers. I have pretty extensive experience building complex scrapers in Node.js, but before this I spent a lot of years actively building big projects powered by PHP (a lot of these projects are still alive, proved to work great in long term, and are evolving). Hopefully, this simple tutorial will be useful for you! Contents: * JS and non

ScrapeNinja: never handle retries and proxies in your code again

I am glad to announce that ScrapeNinja scraping solution just received major update and got new features: Retries Retries are must have for every scraping project. Proxies fail to process your request, the target website shows captchas, and all other bad things happen every time you are trying to get HTTP response. ScrapeNinja is smart enough to detect most of failed responses and retries via another proxy until it gets good response (or it fails, when number of retries is bigger than retryN

Simple proxy checker script via CURL

While working on the ScrapeNinja scraping solution, I often need to verify if particular proxy is alive and if it is performing well. Since I don't want to use various online services, especially for private proxies with user&password authentication, I have written a simple bash script which is much more concise than typing all the commands to terminal CURL manually: #!/bin/bash # download, do chmod +x and copy to /usr/bin/local via ln -s /downloaded-dir/pcheck.sh /usr/local/bin/pcheck # then

How to bypass CloudFlare 403 (code:1020) errors [UPDATED 2023]

I've recently started getting Cloudflare 1020 (403) errors when scraping some random e-commerce website. At first, I thought that the website didn't like my scraper IP address, but changing IP addresses to a clean residential proxy and even my home network didn't fix the issue. Strangely, when the website was opened in Chrome, it opened without any problems. I've opened Chrome Dev tools and did "Copy as cURL" operation from the Network tab, exactly how I always do it when debugging the scraping