Puppeteer API service for web scraping

Puppeteer API service for web scraping

Table of Contents

Okay, let's admit it - web scraping via Puppeteer and Playwright is the most versatile and flexible way of web scraping nowadays. Unfortunately it's also the most cumbersome, time consuming way of scraping, and sometimes it feels a little bit like voodoo magic. This is a post about my long & painful journey on how I was taming real Chrome browser, controlled programmatically via Puppeteer, for my web scraping needs.

First and foremost: do not use real Chrome browser for scraping unless it is absolutely necessary

You should always research if the website uses API requests returning JSON responses under the hood, this way you save yourself a lot of time and resources on data retrieval and extraction step. Discovering hidden APIs requires you to master Chrome Dev Tools Network tab - and if you are serious about web scraping - you should make Chrome Dev Tools (or Firefox console, as an alternative) your everyday tool to see how websites work. See my discovering hidden APIs Youtube video to get an idea how Chrome Dev tools might be used for this.  

If there are no APIs, try to get raw website response via cURL-style tools (Python requests, golang net/http, PHP cURL, node.js node-fetch/superagent/got), and extract the data from retrieved HTML using  regular HTML parsers, like cheerio kind of tools, or regular expressions. And only if you have tried both  these ways, and clearly understand why these did not work for your case, your last resort is a real headless browser scraping.

Let me repeat it again: consider Puppeteer scraping as the last resort, if you need real human-like interaction with the page, or there are no APIs, or the API responses are heavily obfuscated.

Getting started with Puppeteer is easy. Don't get too optimistic though.

Google search results are packed with "scrape with Puppeteer" types of articles. Getting started in Puppeteer invariably makes all newcomers so excited - wow, I can scroll and click links via invoking simple js functions, this is so easy, it's just 5 lines of code! These lines are what all tutorials on the web show us, right? All of the conclusions and notes below apply to Puppeteer, Playwright and any other solutions which allow to manage real browser programmatically via DevTools protocol –  I will only use Puppeteer code  for brevity, across this writeup.

const puppeteer = require('puppeteer');

(async () => {
    //Creates a Headless Browser Instance
    const browser = await puppeteer.launch();

    //Creates a Page Instance, similar to creating a new Tab
    const page = await browser.newPage();

    //Navigate the page to url
    await page.goto('https://example.com');
	await page.$eval('input[name=btn]', button => button.click());

	await page.screenshot({path: 'example.png'});
    //Closes the Browser Instance
    await browser.close();
})();

At the same time, the cost of entry into real web scraping via Puppeteer, and integrating this scraping solution into your existing code is much, much higher than it seems. Here is the (non-exhaustive) list of issues I have encountered when working with Puppeteer for web scraping:

Issue #1: Puppeteer is slow and resource hungry

Oh boy, Puppeteer is so resource hungry! Especially compared to raw cURL or node-fetch requests, Puppeteer is a slow and heavy mammoth. If you are building a node.js long-running process that creates Puppeteer instances, the best advice here is to always close the browser instance after you did the job, and avoid long-running sessions. Just to be safe, I would also recommend to use pm2 process manager and set the memory limit process restart .

Issue #2: Complex setup

Installing Puppeteer and making sure it runs smoothly is still a much more complex task compared to zero-maintenance of basic js packages from npm, and even comparing to running whole mysql server. While docker images with Puppeteer make the process bearable, installation and running issues are still sprinkled over Puppeteer issue queue .

Issue #3: Timeouts and exceptions

Real browsers scripted scenarios are super-flaky, and this is especially true if we talk about real browser web scraping through a non-datacenter proxy with not-so-perfect connection speed. Literally every line of puppeteer JS code can (AND WILL) timeout and throw an exception. It is very easy to wrap a whole puppeteer scraping logic into try {} catch {} block, but then you need to figure out what to do next when these exceptions are thrown. Do you really need to retry the whole scenario? Would you prefer to retry just one "goto()" statement? Is it your proxy failing permanently on a particular url? May be you need to switch to another proxy? Or.. may be a target website has updated their HTML layout so your waitForSelector is now failing because particular css class does not exist on a page anymore? All these questions are rising often, and unfortunately they not always easy to reason about if we talk about remote server execution.

By the way, if you want to handle specifically timeout puppeteer errors, you can use this kind of check in your catch block:

try {
  await page.waitForSelector('.foo');
} catch (e) {
  if (e instanceof puppeteer.errors.TimeoutError) {
    // Do something if this is a timeout.
  }
}

Issue #4: Anti-Scraping solutions

Web scraping is a cat and mouse game and if you are launching the Chrome via Puppeteer and setting good User-Agent hoping to scrape some website which is not your homepage, in the wild web, you will quickly get disappointed. The second step after first captchas and issues is to use puppeteer-extra-plugin-stealth plugin, and then you might get deeper and deeper in this rabbit hole of looking like a true human browser..

Issue #5: Proxy rotation and timeouts

Specifying proxies with login and password auth in Puppeteer still feels somewhat hacky. You need to specify proxy in browser instance launch args, but login and password for the http proxy need to be filled into the page context:

const browser = await puppeteer.launch({
args: [
	'--proxy-server=YOUR-PROXY-SERVER:PORT'
],
});

and then, later in another part of your code, when `page` object was created:

await page.authenticate({
    username: 'USER-NAME',
    password: 'PASSWORD',
});

Retrying the request in case the proxy is down now requires a full new Chrome instance and new page.authenticate call. Nothing too awful but this adds up..

Issue #6: Debugging Puppeteer is painful

While dev protocol has debugger included,  I am still constantly finding myself writing the html and screenshot dumping logic into almost all of my basic scrapers to understand what is going on, especially if I am working on a remote server - it is just more manageable for a short-lived browser requests than using the local Chrome instance (I don't know, may be it's just me). Anyways, here is how you activate remote debugger in Puppeteer:

const browser = await puppeteer.launch({
  args: ['--remote-debugging-port=9222'],
});

Now, you can connect to the running Puppeteer process from your local Chrome browser (just do not forget to slow the Puppeteer session down if you want to see how your script works, or set breakpoints in right places by putting debugger; statement there), open the chrome://inspect page and you should see the active browser process running there.

Issue #7: Puppeteer does not play well with Python

Puppeteer is a Node.js package and is written in Javascript. Python developers are usually locked into much more rigid Selenium which was designed for UI testing tasks, and is simply not a good fit for real world web scraping.

Debugging the remote Puppeteer in VS Code

Since I am using VS Code on the remote server, and I am launching my Node.js with Puppeteer there in Ubuntu, and my Chrome is launched on my MacBook, I make sure that the port of the remote debug process (9222) is mapped to a local port on localhost, otherwise the local Chrome instance won't see the Puppeteer process running somewhere on the remote machine. Other than that, the workflow is exactly the same as with local Puppeteer instance.

Port Forwarding in VS Code Remote. Just nearby the Terminal pane.

My Solution: Puppeteer as a  REST API.

In my own experience, Puppeteer scraping code, wrapped in try catches, and retries, invariably turns my codebase into a mess. Even if I am doing my best and try to extract the Puppeteer code into a separate clean class, my abstractions eventually leak into a business logic. I also spend much more time than I want, every time I am doing this for a quick script. This discomfort, and the fact that Puppeteer setup process  is suboptimal, just makes creating prototypes and some quick demo scrapers too time consuming task. Nothing too awful, all these technicalities are bearable, especially for a bigger project, but after I caught myself a number of times thinking that I prefer to not build the simple scraper just because running Puppeteer feels like a pain and an overkill, it made me think that the same thoughts might be there in heads of other developers who just want to get the job done.

My SaaS ScrapeNinja specializes in high-performance scraping without JS evaluation, and I knew how painful can Puppeteer be when you need to run it reliably in production, so I was hesitant if I should implement this feature, but so many customers of ScrapeNinja asked for JS evaluation, and I also felt I would use this by myself, so I had finally integrated real Chrome browser into ScrapeNinja Scraping API.

Puppeteer API for Javascript and Python developers

Essentially, ScrapeNinja new /scrape-js endpoint is a convenient way to start using Puppeteer for web scraping and have real Javascript and DOM rendering without suffering with Puppeteer setup!

Implementation details

If you take a look at the https://rapidapi.com/restyler/api/scrapeninja API surface, you will notice that now there are two endpoints there: /scrape and /scrape-js - so the first one is doing good old high performance scraping via low-level cURL like utility which emulates Chrome TLS fingerprint, and the new /scrape-js endpoint does exactly what you might think it does - it launches the real Chrome browser, without Puppeteer fingerprints and trails. I have tried to make the new endpoint behave similar to the old one, though some of the scraping optional parameters still differ. The new /scrape-js endpoint is approximately 3-4x slower and 10x more resource hungry, but this should not concern you as an end customer of ScrapeNinja, managing Chrome instances is now my headache :)

Under the hood, /scrape and /scrape-js API endpoints are routed into two separate node.js micro-services via nginx reverse proxy, so eventually when I get more API requests hitting /scrape-js , it will be easily load-balanced to multiple dedicated servers. The routing is now done via nginx, I know you are curious if it's hard to implement – no, it's absolutely not! here is the nginx config which routes two endpoints to different node.js services on the same machine (currently):

# put this to /sites-available/default
server {
	listen 8080 default_server;
	listen [::]:8080 default_server;

	proxy_http_version 1.1;
	proxy_cache_bypass $http_upgrade;
	proxy_set_header Upgrade $http_upgrade;
	proxy_set_header Connection 'upgrade';
	proxy_set_header Host $host;
	proxy_set_header X-Real-IP $remote_addr;
	proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
	proxy_set_header X-Forwarded-Proto $scheme;

	location / {
		proxy_pass http://localhost:8021;
	}

	location /scrape-js {
		# puppeteer can be slow! increase the timeout to avoid 504 errors
		proxy_read_timeout 120s;

		proxy_pass http://localhost:4333;
	}
}

Hitting the sweet spot between Puppeteer flexibility and manageable code base

This was the hardest part. There are two big chunks of competition on the hosted-real-browser market now:

  1. Very flexible solutions to run unmodified Puppeteer code, for example https://www.browserless.io/
  2. Very dumbed-down scraping solutions which execute real browsers under the hood, but are not good for page interaction.

I wanted something in between. I want to be able to quickly get a page output without writing a lot of async functions and sending real JS code to the API. At the same time, I want to be able to interact with page elements (click, modify inputs, submit forms like a human) relatively easily if I want. Here is how the syntax of /scrape-js looks like now:

const fetch = require('node-fetch');

fetch("https://scrapeninja.p.rapidapi.com/scrape-js", {
  "method": "POST",
  "headers": {
    "content-type": "application/json",
    "X-RapidAPI-Key": "YOUR-KEY",
    "X-RapidAPI-Host": "scrapeninja.p.rapidapi.com"
  },
  "body": '{
    "url": "https://www.wikipedia.org/",
    "steps": [
      {
        "type": "change",
        "value": "Finland",
        "selector": "#searchInput"
      },
      {
        "type": "click",
        "selectors": [
          "#search-form button[type=submit]"
        ],
        "waitForNavigation": true
      },
      {
        "type": "waitForElement",
        "selectors": [
          "#firstHeading"
        ]
      }
    ],
    "retryNum": 1,
    "geo": "us"
  }'
})
.then(response => {
  console.log(response);
})
.catch(err => {
  console.error(err);
});

So, .steps property accepts an array of operations which the remote Chrome browser will execute.  More examples and better docs on what you can do via steps operations will come soon!

Here is the sample JSON response of ScrapeNinja JS server:

{
    "info": {
        "js": true,
        "attemptsNum": 1,
        "statusCode": 200,
        "screenshot": "https://cdn.scrapeninja.net:2096/2022-07-01T12:00:58.238Z-64d8a34b0bd9069d.png",
        "headers": [
            "date: Thu, 30 Jun 2022 21:56:52 GMT",
            "server: mw1367.eqiad.wmnet",
            "last-modified: Thu, 30 Jun 2022 21:46:24 GMT",
            "cache-control: s-maxage=86400, must-revalidate, max-age=3600",
            "content-type: text/html",
            "content-encoding: gzip",
            "vary: Accept-Encoding",
            "accept-ranges: bytes",
            "content-length: 17982"
        ],
        "pageCookies": "enwikimwuser-sessionId=727e6eb2b4333333; enwikiwmE-sessionTickTickCount=1; WMF-Last-Access=01-Jul-2022; GeoIP=US:::37.44:-33.82:v4; enwikiwmE-sessionTickLastTickTime=1656676858050; WMF-Last-Access-Global=01-Jul-2022",
        "errStack": []
    },
    "body": "<html class=\"client-js ve-available\" lang=\"en\" dir=\"ltr\"><head>\n<meta charset=\"UTF-8\">\n<title>Finland - Wikipedia</title>\n<script>document.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":false,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],</a></body></html>"
}

Pricing

For now ScrapeNinja counts one /scrape-js call as 5 /scrape regular calls, so basically for 15 USD PRO plan on RapidAPI you get 10000 Puppeteer calls with each call, including  rotating proxies with multiple geos available, multiple retries (so up to 3 real browser instance creations in single API call), and interaction scenarios via steps syntax specified above. Or, you get 50000 raw ScrapeNinja calls instead (and I definitely recommend to try if /scrape suits your web scraping needs, before digging deep into /scrape-js!).

Launching Chrome and testing scraping without leaving your browser

This is another new feature of ScrapeNinja: online sandbox for quick scraping debugging: https://scrapeninja.net/scraper-sandbox

I am a fan of low-code and no-code approach, and this sandbox UI was created to generate the basic, but bulletproof, ScrapeNinja-backed Node.js scraper without actually writing a single line of code. Yay!

It is a pleasure to test new JS evaluation endpoint in browser, especially because ScrapNinja sandbox shows screenshots for each request, which significantly reduces friction when debugging the scraper.

ScrapeNinja Chrome as REST is still in beta  

ScrapeNinja new endpoint is just scratching the surface of the huge pool of various Chrome scraping use cases. You can't do a lot of things in ScrapeNinja scrape-js endpoint for now: for example, you can't easily keep long sessions, scroll, and click. This is something I plan to introduce if I get customer requests for this.

There are other IaaS ("infrastructure as a service") solutions on the web which specialize in just generic Chrome browser instances hosting, but this is not what I wanted to build. ScrapeNinja JS evaluation is an intentionally "dumbed down" version of Chrome, so the syntax is concise to be used as a REST API. Please try it for you next scraping project and let me know your experience!

https://scrapeninja.net/

https://rapidapi.com/restyler/api/scrapeninja/

UPDATE (SEP 2022)

I think with the help of my wonderful ScrapeNinja customers, I have finally polished Puppeteer API to the state when I can call it stable and ready for production use, and of course let me know if you encounter any bugs!