Modern web scraping with Playwright: choosing between Python and NodeJS

When diving into the world of automated browser testing and scraping with Playwright, one of the first decisions you'll encounter is the choice of programming language. Playwright is not a one-language wonder; it caters to a polyglot audience. Let's see how Node.js and Python version of Playwright compare.

A bit of a history

Playwright was created by a guy who was one of authors of Puppeteer.js: Andrey Lushnikov (who was part of Chrome DevTools team back then). Playwright was built on the lessons of Puppeteer: it was cross-browser from scratch (while Puppeteer got Firefox experimental support only recently in 2023), it had better syntax, it had a lot of higher level tools e.g. for test runners.

Node.js or Python?

Node.js: the native habitat

Playwright was born in the Node.js ecosystem, making it a natural habitat for this tool. If you're scaling up, Node.js shines particularly brightly. Why, you ask? It boils down to process management.

Node.js's version of Playwright doesn't spawn a new node process for every browser instance. This is crucial when you're managing multiple browser instances simultaneously. If you've got a system running various scripts at unpredictable intervals, you definitely want to avoid the overhead of spinning up a new node process each time.

I stumbled upon a GitHub issue that put this into perspective. A developer was wrestling with the Python Playwright implementation, which, under the hood, was spinning up a separate node process for each instance. The result? A CPU and memory usage spikes.

Here is another GitHub issue where Playwright maintainers recommend Node.js for heavy lifting: [Question]: Performance benchmarking of python playwright versus node #1289 and another one.

Python: simplicity meets elegance

Python's implementation of Playwright, while elegant and simple for scripting, may not be the best companion for heavy lifting. Each call to sync_playwright() in Python fires up a new node process. Although Python is a delight for quick scripts and data analysis, this behavior might bog down your system when you're trying to keep it light on its feet.

Other Languages: a world of choices

It's worth noting that Playwright isn't just a two-player game. It extends its reach to other languages like Java and C#. If you're already working within these ecosystems, it makes sense to stick to your guns and use what you're comfortable with.

My recommendation: Node.js for scalability

For heavy-duty scraping and testing, especially at scale, Node.js is the recommended route. It keeps your CPU from sweating and your memory from overflowing. Moreover, Node.js being Playwright's native environment, you get first-class support and performance.

Code example: blocking image download

Let's get our hands dirty and take a look how basic code looks like in Node.js and Python versions of Playwright. Here's how you can block images when opening a page using both Node.js and Python with Playwright.

Playwright Node.js version

const { chromium } = require('playwright'); // or 'firefox' or 'webkit'

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  // Block images
  await page.route('**/*.{png,jpg,jpeg}', route => route.abort());

  await page.goto('https://example.com');
  // ... you can perform actions on the page here

  await browser.close();
})();

In this Node.js example, we use Playwright's route method to intercept network requests and abort any that are for image resources.

Playwright Python version

For the Python version, you will need to have the Playwright Python package installed and the Playwright browser binaries downloaded. This can be done by running pip install playwright followed by playwright install.

import asyncio
from playwright.async_api import async_playwright

async def block_images_and_open_page():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()

        # Block images
        await page.route('**/*.{png,jpg,jpeg}', lambda route: route.abort())

        await page.goto('https://example.com')
        # ... you can perform actions on the page here

        await browser.close()

asyncio.run(block_images_and_open_page())

In this Python async example, we use the async_playwright context manager to handle the Playwright object lifecycle and route method to intercept and abort image requests.

Both snippets demonstrate how to block image loading, which can be particularly useful when scraping websites where images are not needed, thus speeding up page load times and reducing bandwidth usage.

Playwright for web scraping: fingerprinting and stealth modes

If we talk about using Playwright for real-world web scraping, this is where things get interesting, especially in the context of comparing Node.js with Python.

Official Playwright packages simply do not have a task to remove all traces of automation on the browser, as the primary goal of programmatically controlled browser was always UI testing, and not web scraping.

Most of big websites implement basic or sophisticated anti-scraping measures which try to fingerprint your web browser and block it if it looks like an automated one (so stock versions of Playwright are blocked by most of modern websites which have any kind of anti-scraping protection, even the most basic one). There are huge companies which are involved in detecting scrapers on multiple levels:

For Node.js, there exists a whole ecosystem of stealth plugins, and you can use these stealth improvements to make your Playwright instance look more like a regular human browser:

Fingerprint Suite from Apify

https://github.com/apify/fingerprint-suite

Playwright Extra

This is basically an interoperability layer with Puppeteer stealth packages.

https://www.npmjs.com/package/playwright-extra

These two sets of tools are not a silver bullet and do not guarantee successful scraping process, for example Apify Suite and Playwright Extra perform much worse than ScrapeNinja /scrape-js in terms of detection, but at least they are trying to help you in this regard.

Of course, Python developers want to have similar solutions for Playwright Python – here is a Github issue about it, which was closed without any good resolution.

If you still need a stealthy Python browser, your best bet is Undetected Chromedriver (it is based on heavily hacked Selenium)

Conclusion

Choosing the right tool for the job is as important as the job itself. In the world of Playwright, Node.js offers the best scalability, stealth mode and performance, especially for complex, resource-intensive tasks like web scraping. Python may seduce with its syntax and simplicity, but when it comes to Playwright, the Node.js version holds the upper hand for scaling. Python and other language bindings have their place, but for the heavy-duty browser automation, Node.js is the way to go. But.. if you are comfortable with Python and you are new to Node.js ecosystem, I would definitely recommend to stick to the tool you know until you hit any real performance issues.

If you liked this post, you will probably enjoy reading my blog post where I used low code to web scrape Apple.com for refurbished iPhones and to get push alerts and how I built specialized browser API for web scraping.