How do download PDF in Playwright
In the ever-evolving world of web scraping, I often come across hurdles that require creative solutions and some quick code workarounds and hacks - and oh boy! this is especially true when I am working with programmatically driven browsers, which I happen to do a lot lately. Today, I'd like to share a challenge I faced while trying to download PDF files using Playwright, and how I managed to overcome it.
The Unexpected Twist with Chromium and Playwright
Initially, after quickly browsing Playwright Docs section about downloading files, I thought downloading a PDF would be straightforward with Playwright. However, I encountered a peculiar behavior of the chromium
engine – instead of downloading the PDF, it opened it in a new tab. This was a curveball I hadn't anticipated.
Why Playwright?
I chose Playwright for its robustness in automating and controlling different browsers. Its ability to handle complex interactions made it an ideal candidate for this task. I usually used Puppeteer in 2021-2022 (Playwright for Node.js and Puppeteer are similar solutions to the same problem: working with CDP protocol to control browsers), but I am now switching to Playwright because it has more syntactic sugar and smart workarounds built into its core, which makes my Node.js code for browsers more concise and easier to read.
The Challenge of Downloading the PDF
Navigating to the PDF's URL and expecting a download prompt was my first step. But, as I quickly learned, things were not that simple. The PDF wasn't directly accessible; it required specific user interactions to be generated.
Simulating User Interactions
Using Playwright, I scripted the necessary user interactions, like button clicks and form submissions. Playwright's page.waitForResponse()
was particularly useful in handling AJAX calls that were part of this process.
Addressing the Chromium Quirk
The real challenge was dealing with Chromium's tendency to open PDFs in a new tab. To solve this, I used Playwright's routing capabilities to intercept the PDF request and modify the response headers. Here’s the fix:
await page.route('**/file.pdf', async route => {
const response = await page.context().request.get(route.request());
await route.fulfill({
response,
headers: {
...response.headers(),
'Content-Disposition': 'attachment',
}
});
});
This code intercepts the request for the PDF and modifies the response headers to include a Content-Disposition: attachment
. This forces the browser to download the file instead of displaying it.
Full example of PDF download in Playwright
Please use at least Node.js 16 for this code, and make sure type:module
is there in your `package.json` so imports are working.
import { chromium } from 'playwright'; // Importing Playwright's WebKit module
import fs from 'fs';
import path from 'path';
// Specify the directory where you want to save the PDF
const downloadPath = path.join(__dirname, 'downloads');
fs.mkdirSync(downloadPath, { recursive: true });
// Launching the browser
const browser = await webkit.launch();
const context = await browser.newContext({
// Setting the download path for the browser context
downloadsPath: downloadPath
});
// Opening a new page
const page = await context.newPage();
// Intercepting PDF requests and modifying the headers to force download
// Note that we are using endsWith() function here for a more
// loose catch-all strategy for all pdf files for a
// more generic approach
await page.route('**/*', async (route, request) => {
if (request.resourceType() === 'document' && route.request().url().endsWith('.pdf')) {
const response = await page.context().request.get(request);
await route.fulfill({
response,
headers: {
...response.headers(),
'Content-Disposition': 'attachment',
}
});
} else {
route.continue();
}
});
// Navigate to the page where the PDF can be downloaded
await page.goto('https://example.com/path-to-pdf');
// Additional steps to trigger the PDF download, if necessary
// For example: await page.click('selector-to-download-button');
// Wait for the download to complete
page.on('download', async (download) => {
const downloadUrl = download.url();
const filePath = await download.path();
console.log(`Downloaded file from ${downloadUrl} to ${filePath}`);
});
// Keep the browser open for a short period to ensure download completes
await new Promise(resolve => setTimeout(resolve, 10000));
// Close the browser
await browser.close();
Wrapping Up
To my fellow developers, I hope this code helps you in your web scraping tasks. Remember, the path to a solution often requires a deep understanding of your tools and the ability to adapt to unexpected challenges.
Happy coding, and may your downloads be smooth and your code bug-free!
Also read a story how I built a SaaS to run Puppeteer as a service and how to bypass Cloudflare 403 error when web scraping.