How to set proxy in Puppeteer: 3 ways

Table of Contents

Puppeteer is an incredibly useful tool for automating web browsers. It allows to run headless (or non-headless) Chrome instances, automatically interacting with websites and pages in ways that would normally require manual input from a user or other scripts. In a lot of cases (particularly in web scraping tasks) it is required for HTTP requests to look like they originate from different IPs or networks than your server running Puppeteer – and this is where proxies come into play. In this blog post I will go over two ways how to set proxy settings using Puppeteer.

1. Important thing to know about Github packages for proxying Puppeteer requests

There is a number of Github packages available which claim that they provide a convenient way to specify proxy for Puppeteer, e.g. https://github.com/Cuadrix/puppeteer-page-proxy and https://github.com/gajus/puppeteer-proxy – but looking into source code it becomes apparent that they are using a pretty hacky approach of intercepting and copying all Puppeteer request information to another, fundamentally different, package, to perform HTTP requests!  This means proxied request won't have Chrome TLS fingerprint, and will have Node.js fingerprint instead (which might lead to bad results in a lot of web scraping cases). Let me put it this way: these two Github packages are not setting proxy for Puppeteer, they are essentially setting proxy for another network library and copying HTTP request data back and forth, for every HTTP request, from Puppeteer, to this network library, and then sending HTTP responses back to Puppeteer. This means all sorts of interesting things, like copying cookie headers from raw HTTP set-cookie format to puppeteer format, now need to be figured out on the proxy package level, which definitely does not look too good to me. The gajus/puppeteer-proxy explicitly states this in README: puppeteer-proxy intercepts requests after it receives the request metadata from Puppeteer. puppeteer-proxy uses Node.js to make the HTTP requests. The response is then returned to the browser. When using puppeteer-proxy, browser never makes outbound HTTP requests.

2. Set proxy via launch arguments of Puppeteer

Luckily, Puppeteer has a native option to set proxy via args:

const puppeteer = require('puppeteer');

async function run() {
  const browser = await puppeteer.launch({
    headless: false,
    args: [ '--proxy-server=http://your-proxy-host:3128' ]
  });
  const page = await browser.newPage();

  const pageUrl = 'https://whatismyipaddress.com/';

  await page.goto(pageUrl);
}

run();

That's it! All Puppeteer instance requests are routed through the http proxy specified in launch arguments.

It is important to understand that the proxy in this case is set for the whole Puppeteer instance, and not on a page-by-page basis.

I recommend to check what ip address the target website ( https://whatismyipaddress.com/ ) shows to you to make sure that proxy works properly, the easiest way to do this is to ask Puppeteer to make a screenshot:

// put this after page.goto() call
await page.screenshot(('/path/screenshot.jpg');

and check if the screenshot shows the ip address of your proxy, instead of the server where Puppeteer is run.

Unfortunately, this approach of setting proxy is a bit limited: you can't pass login and password of the proxy into --proxy-server argument.. but read on!

How to set proxy which requires authentication

A lot of proxies are using basic HTTP authentication, and in this case, the proxy URL might look like this: http://username:password@proxy-host:port - and it turns out, we can't just pass  username and password to Puppeteer via launch args!

Don't worry, here is an how it's done: when you have already created a page object, just call .authenticate() method:

const puppeteer = require('puppeteer');

async function run() {
  const browser = await puppeteer.launch({
    headless: false,
    args: [ '--proxy-server=http://your-proxy-host:3128' ]
  });
  const page = await browser.newPage();

  // authenticate in proxy using basic browser auth
  await page.authenticate({username:user, password:password});

  const pageUrl = 'https://whatismyipaddress.com/';

  await page.goto(pageUrl);
}

run();

The only thing I don't like in this approach is that in my code, the browser object and page object are always initialized and then used in different functions, and this means that proxy logic is now smeared with a thin layer over my source code.

Native Puppeteer way to set different proxies for different sets of pages, in one browser instance

The native way to set proxy via Chrome launch args has one important issue: in case you want to use different proxies for different conditions (e.g. different target websites) you need to have separate Chrome instances per each proxy. If you don't want to use any additional npm packages to set proxy for Puppeteer, and you also don't want to spend your server resources creating a lot of Chrome instances (1 Chrome instance per proxy), you still have a way to set different proxies for different pages! Use browser contexts:

// Create a new incognito browser context
const context = await browser.createIncognitoBrowserContext({ proxy: 'http://localhost:2022' });
// Create a new page inside context.
const page = await context.newPage();

// authenticate in proxy using basic browser auth
await page.authenticate({username:user, password:password});
// ... do stuff with page ...
await page.goto('https://example.com');
// Dispose context once it's no longer needed.
await context.close();

createIncognitoBrowserContext  function accepts BrowserContextOptions array of options, which allows to specify proxy address:

https://pptr.dev/api/puppeteer.browser.createincognitobrowsercontext/

https://pptr.dev/api/puppeteer.browsercontextoptions

3. Proxy-chain. My preferred method to set proxy for Puppeteer

Now, a wonderful https://www.npmjs.com/package/proxy-chain package comes into play! The concept of this package is a bit complicated at first glance, but is very nice - it launches an intermediate proxy server on your local machine, and Puppeteer instance connects to this proxy, which in its turn proxies the request to external proxy. This approach solves a lot of issues and provides new features:

  • the TLS fingerprint of the HTTP request is from Puppeteer, not from random node.js networking package
  • there is no need to use page.authenticate() anymore, Puppeteer connects to "http://localhost:3233" or whatever port of proxy-chains is - and the authentication against external proxy is done in proxy-chain package
  • you get nice stats of all your Puppeteer outgoing connections and can do awesome things there
  • it is possible to route different target websites though different external proxies

Here is the basic working example which leverates anonymizeProxy() method from proxy-chain package:

const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');

(async() => {
    const oldProxyUrl = 'http://bob:password123@proxy.example.com:8000';
    const newProxyUrl = await proxyChain.anonymizeProxy({ url: oldProxyUrl });

    // Prints something like "http://127.0.0.1:45678"
    console.log(newProxyUrl);

    const browser = await puppeteer.launch({
        args: [`--proxy-server=${newProxyUrl}`],
    });

    // Do your magic here...
    const page = await browser.newPage();
    await page.goto('https://www.example.com');
    await page.screenshot({ path: 'example.png' });
    await browser.close();

    // Clean up
    await proxyChain.closeAnonymizedProxy(newProxyUrl, true);
})();

If you want to do more awesome proxy-related things, like measure stats for HTTP incoming/outgoing traffic, or implement your own proxy auth for your customers, or use multiple external proxies and load balance between them, please review proxy-chain docs on Github.

I use proxy-chain package in ScrapeNinja web scraping API, so if you don't want to deal with this proxy complexity and running your own instance of Puppeteer, and just want to get final HTML from any website, just try ScrapeNinja and use whatever custom proxy you need (yes, you can pass your own custom proxy to ScrapeNinja), or just use one of ScrapeNinja's proxy clusters with rotating ips.

Read how I built ScrapeNinja Puppeteer API

Unblockable web scraping in Node.js

Web scraping is Node.js is cumbersome, all these timeouts, retries, proxy rotation, bypassing website protections, running Puppeteer and deciding if you really need Puppeteer overhead in every particular case, and extracting data from HTML is a huge pain. After spending months of my life building Javascript web scrapers, I ended up re-using same patterns and copy&pasting my web scraping code again and again, and then I realized this might be useful for other people, as well. I have built and bootstrapped ScrapeNinja.net which not only finally made my life a lot easier, but also helped hundreds of fellow developers to extract web data! I now build and test all my web scrapers in browser, spending minutes instead of hours, and then I just copy&paste generated Javascript code of the web scraper from ScrapeNinja website to my node.js server so it can safely run in production.  Try ScrapeNinja in browser and let me know if it helps you, too.