scrapeninja

Choosing a proxy for web scraping

Choosing a proxy for web scraping

Once you're familiar with basic web scraping tools like Scrapy, and you've scraped your first 1-2 websites, you'll probably get your first ban because your IP address has made too many requests (what "too many" means really depends on the site, for one site it's just 3 requests per hour, for another site it's 100 requests in a 5 minute window).  It's important to make sure that the site ban is actually related to an ip address from which you're sending your requests: to check that it's not a coo

10 min read
Puppeteer: click an element and get raw JSON from XHR/AJAX response

Puppeteer: click an element and get raw JSON from XHR/AJAX response

This lately became a pretty popular question when scraping with Puppeteer: let's say you want to interact with the page (e.g. click the button) and retrieve the raw ajax response (usually, JSON).  Why would you want to do this? This is actually an interesting "hybrid" approach to extracting data from the web page - while we interact with the page like a real browser, we still do not mess around with the usual DOM traversing process to extract the data, and we grab raw JSON server response instea

5 min read
Puppeteer API service for web scraping

Puppeteer API service for web scraping

Okay, let's admit it - web scraping via Puppeteer and Playwright is the most versatile and flexible way of web scraping nowadays. Unfortunately it's also the most cumbersome, time consuming way of scraping, and sometimes it feels a little bit like voodoo magic. This is a post about my long & painful journey on how I was taming real Chrome browser, controlled programmatically via Puppeteer, for my web scraping needs. First and foremost: do not use real Chrome browser for scraping unless it is a

10 min read

How to do web scraping in PHP

Web scraping is a big and hot topic now, and PHP is a pretty fast language which is  convenient for rapid prototyping and is wildly popular across web developers. I have pretty extensive experience building complex scrapers in Node.js, but before this I spent a lot of years actively building big projects powered by PHP (a lot of these projects are still alive, proved to work great in long term, and are evolving). Hopefully, this simple tutorial will be useful for you!   Contents: * JS and non

7 min read

ScrapeNinja: never handle retries and proxies in your code again

I am glad to announce that ScrapeNinja scraping solution just received major update and got  new features: Retries Retries are must have for every scraping project. Proxies fail to process your request, the target website shows captchas, and all other bad things happen every time you are trying to get HTTP response. ScrapeNinja is smart enough to detect most of failed responses and retries via another proxy until it gets good response (or it fails, when number of retries is bigger than retryN

2 min read