Cheerio: parse HTML in Javascript. Playground and cheatsheet
Table of Contents
Cheerio is a de-facto standard to parse HTML in a server-side Javascript (Node.js) now. It is a fast, flexible, and lean implementation of jQuery-like syntax designed specifically for the server.
Github: https://github.com/cheeriojs/cheerio Stars: 25.8K
NPM: https://www.npmjs.com/package/cheerio
Cheerio is a pretty performant solution to extract data from raw HTML web pages, and is perfect for web scraping tasks when you don't need real browser rendering or you just don't want to use Puppeteer / Playwright.
I have already built a number of Javascript web scrapers which parse HTML and extract JSON from it, but for me, writing and testing Cheerio code still takes a significant amount of time. The syntax of Cheerio is somewhat similar to jQuery, but it is still Node.js, and not "real" DOM and not real jQuery.
I was suffering every time I was googling for "Cheerio quick examples" and "Cheerio how to iterate over element children". Not anymore.
Cheerio Online Playground
I have built an online Cheerio sandbox to quickly test cheerio syntax against various test inputs. Think of this as a regex101 but for Cheerio selectors instead of regex strings. This is already insanely helpful for myself and saves me up to 15-30 minutes on every simple scraper I am writing, I think, just because I have working selectors samples at hand and I can quickly test my new selectors.
UPD JUL 2024: I have launched the v2 of the sandbox, try it here, this version generates Cheerio.js extractors using AI
Cheerio Cheatsheet
First of all, let me recommend you to read Cheerio docs first: Selecting Elements - this is very useful and contains a lot of information you might ever need when writing good selectors.
Initializing Cheerio DOM from HTML string
let $ = cheerio.load('<html><body><h1 id="title">Sample title</h1></body></html>');
// extract title
console.log($('#title').text());
In all code below $
is a cheerio DOM object.
Replace multiple spaces with a single space
This is useful to cleanup html descriptions scraped from websites.
$('.descr').html().replace(/\s\s+/g, ' ').trim()
Iterate over children and return them as an array of objects
#1: functional approach with map():
let parsedItems = $('.first-list li').map(function() {
return {
descr: $('.descr', this).html().replace(/\s\s+/g, ' ').trim(),
price: $('.price', this).text().replace('$', '')
}
}).toArray();
console.log(parsedItems);
Important: there are two ways to pass callbacks to iterate over children in cheerio. This works in map
and each
approach identically. First one, you can pass function()
and, and access current element this
object. Or, you can pass arrow callback and in this case you will need to use second arg to access the current iterated element: (idx, elem) => { }
It is the matter of personal preference if you want to use arrow functions and remember that the element is the second arg, or use this
object inside function()
Also important: do not forget .toArray()
!
Without toArray()
call, you will receive an array of Cheerio nodes which can't be easily converted to plain JS objects (and they can also overwhelm your console.log due to being huge and having recursive links inside)
#2: imperative approach with each():
let items = $('.first-list li');
let parsedItems = [];
items.each((i, elem) => {
parsedItems.push($(elem).text());
});
console.log(parsedItems);
3 ways to write effective and anti-fragile Cheerio selectors for obscure CSS classes
Partial CSS class name matching
Sample HTML stucture:
<div>
<label class="position_label_f33223-random-3ef">Position:</label>
<div>Software Engineer</div>
</div>
Class name is partially static here but the ending part is apparently dynamic.
$('[class^=position_label_]'); // select element by class starting with "position_label_"
Selecting by text value: ":contains" selector
Let's imagine you want to extract value from an element with completely random class, here is the HTML structure:
<div>
<label class="wd321213d3332_f233ef">Position:</label>
<div>Software Engineer</div>
</div>
"Software Engineer" is a value we need to extract.
Looking into CSS class, it's pretty apparent that the class name is generated randomly by frontend build pipeline so the dumb .wd321213d3332_f233ef
CSS selector is too fragile and will get broken on next website deployment.
Leveraging :contains
selector supported by Cheerio is a great appoach for such cases when a website has completely dynamic CSS classes on elements:
$('label:contains("Position:") + div').text()
Selecting by data attributes
Sample HTML:
<div>
<label class="wd321213d3332_f233ef" data-test-id="position">Position:</label>
<div>Software Engineer</div>
</div>
data-test-id
or data-another-property
are often occurring in HTML and being able to use them to write Cheerio selectors is nice, especially when class names are dynamic. Here is a good Cheerio selector for this HTML:
$('label[data-test-id="position"] + div').text()
Experiment with these 3 methods, useful for dynamic CSS classes, in Cheerio Sandbox
How to find elements without specific attributes in Cheerio?
Sample HTML:
<div class="content">This div has a class attribute</div>
<div>This div does not have a class attribute</div>
<div class="footer">This div also has a class attribute</div>
Let's find all div elements without a class attribute using the :not
pseudo-class and the attribute selector:
const divsWithoutClass = $('div:not([class])');
Writing Cheerio selectors like a Pro
Getting comfortable with all methods of accessing data through Cheerio is important and useful skill, especially in web scraping.
Here is a sample HTML:
<html>
<body>
<table>
<tr>
<td>First row title</td>
<td>First row value</td>
</tr>
<tr>
<td>Second row title</td>
<td>Second row value</td>
</tr>
</table>
</body>
</html>
Let's imagine we need to access "Second row value" string here and we know that the table always has only 2 rows. How can we do that?
Here are 3 ways to do it!
1. Access element by CSS index
$('table tr:eq(1) td:eq(1)').text().trim()
2. Access element by functional traversing, using next()
$('table').find('td').parent().next().find('td').eq(1).text()
3. Access element by text value of header
$('td:contains("Second row title") + td').text()
If you know how to use all 3, you are ready for some real life Cheerio projects! Do you know another method? Let me know and I will be happy to add it!
Extracting JSON for Linking Data using Cheerio
JSON For Linking Data is essentially a standartized big JSON object embedded into some web pages. It is a very useful piece of information, because it is essentially just machine readable pure data in JSON format: easy to extract & analyze. Read more about JSON for Linking Data on its website.
let ldNode = $("script[type='application/ld+json']:first");
if (ldNode[0] && ldNode[0].children) {
graph = JSON.parse(ldNode[0].children[0].data);
console.log(graph['@type']); // access LD JSON properties
}
Switching between two parsing modes
Cheerio has 2 parser engines under the hood: Very fast parse5 (default) and forgiving htmlparser2 .
Usually, you don't need to worry about which parser to use, until you start getting some weird errors when extracting data. In my case, I got issue with JSON-LD escaping of special characters - as far as I remember, it was & (ampersand) symbol in some string inside JSON, it was breaking things for Cheerio so it could not extract full JSON object. Switching to htmlparser2 made my extractor around 50-70% slower (but it was still sub-200ms) but fixed the error:
const $ = cheerio.load(html, { xml: true });
So, passing xml: true
during Cheerio initialization switches the parser to slower htmlparser2.
Why Cheerio in 2024? Puppeteer already won the scrapers battle, didn't it?
If you use Puppeteer for rendering the websites, there is still room for Cheerio! For example, I prefer to separate the rendering process (because Puppeteer code quickly gets messy and complex) and the extraction process in ScrapeNinja project (which provides the tooling for fast and effective website scraping), even if it means some small performance hit (since you process all the DOM tree of the website twice this way – when Puppeteer renders the website, and then when you pass all the HTML to Cheerio to build the DOM structure again).
So, in ScrapeNinja, I can use Puppeteer to open the website, and interact with it, and then I write and pass the so called extractor function, which uses Cheerio to extract all the data from the HTML. This works pretty well so far!
Short tutorial where ScrapeNinja Cheerio selectors are used for web scraping, in ScrapeNinja Live Sandbox:
Bonus: interactive CSS selectors puzzles
Puzzle #1: extracting data from tables with generated CSS classes
https://scrapeninja.net/scraper-sandbox?slug=task-random-css-class
Solution: https://scrapeninja.net/scraper-sandbox?slug=task-random-css-class-solution
###