Cheerio: parse HTML in Javascript. Playground and cheatsheet

Cheerio is a de-facto standard to parse HTML in a server-side Javascript (Node.js) now. It is a fast, flexible, and lean implementation of jQuery-like syntax designed specifically for the server.

Github: https://github.com/cheeriojs/cheerio Stars: 25.8K

NPM: https://www.npmjs.com/package/cheerio

Cheerio is a pretty performant solution to extract data from raw HTML web pages, and is perfect for web scraping tasks when you don't need real browser rendering or you just don't want to use Puppeteer / Playwright.

I have already built a number of Javascript web scrapers which parse HTML and extract JSON from it, but for me, writing and testing Cheerio code still takes a significant amount of time. The syntax of Cheerio is somewhat similar to jQuery, but it is still Node.js, and not "real" DOM and not real jQuery.

I was suffering every time I was googling for "Cheerio quick examples" and "Cheerio how to iterate over element children". Not anymore.

Cheerio Online Playground

I have built an online Cheerio sandbox to quickly test cheerio syntax against various test inputs. Think of this as a regex101 but for Cheerio selectors instead of regex strings. This is already insanely helpful for myself and saves me up to 15-30 minutes on every simple scraper I am writing, I think, just because I have working selectors samples at hand and I can quickly test my new selectors.

Try the sandbox here.

UPD JUL 2024: I have launched the v2 of the sandbox, try it here, this version generates Cheerio.js extractors using AI

Cheerio Cheatsheet

First of all, let me recommend you to read Cheerio docs first: Selecting Elements - this is very useful and contains a lot of information you might ever need when writing good selectors.

Initializing Cheerio DOM from HTML string

let $ = cheerio.load('<html><body><h1 id="title">Sample title</h1></body></html>');
// extract title
console.log($('#title').text());

In all code below $ is a cheerio DOM object.

Replace multiple spaces with a single space

This is useful to cleanup html descriptions scraped from websites.

$('.descr').html().replace(/\s\s+/g, ' ').trim()

Iterate over children and return them as an array of objects

#1: functional approach with map():

 let parsedItems = $('.first-list li').map(function() {
            return {
                descr: $('.descr', this).html().replace(/\s\s+/g, ' ').trim(),
                price: $('.price', this).text().replace('$', '')
            }
        }).toArray();

console.log(parsedItems);

Important: there are two ways to pass callbacks to iterate over children in cheerio. This works in map and each approach identically. First one, you can pass function()  and, and access current element this object. Or, you can pass arrow callback and in this case you will need to use second arg to access the current iterated element: (idx, elem) => { }  It is the matter of personal preference if you want to use arrow functions and remember that the element is the second arg, or use this object inside function()

Also important: do not forget .toArray() !

Without toArray() call, you will receive an array of Cheerio nodes which can't be easily converted to plain JS objects (and they can also overwhelm your console.log  due to being huge and having recursive links inside)

#2: imperative approach with each():

let items = $('.first-list li');

let parsedItems = [];
items.each((i, elem) => {
	parsedItems.push($(elem).text());
});

console.log(parsedItems);

3 ways to write effective and anti-fragile Cheerio selectors for obscure CSS classes

Partial CSS class name matching

Sample HTML stucture:

<div>
    <label class="position_label_f33223-random-3ef">Position:</label>
    <div>Software Engineer</div>
</div>

Class name is partially static here but the ending part is apparently dynamic.

$('[class^=position_label_]'); // select element by class starting with "position_label_" 

Selecting by text value: ":contains" selector

Let's imagine you want to extract value from an element with completely random class, here is the HTML structure:

<div>
    <label class="wd321213d3332_f233ef">Position:</label>
    <div>Software Engineer</div>
</div>

"Software Engineer" is a value we need to extract.

Looking into CSS class, it's pretty apparent that the class name is generated randomly by frontend build pipeline so the dumb .wd321213d3332_f233ef CSS selector is too fragile and will get broken on next website deployment.

Leveraging :contains selector supported by Cheerio is a great appoach for such cases when a website has completely dynamic CSS classes on elements:

$('label:contains("Position:") + div').text()

Selecting by data attributes

Sample HTML:

<div>
    <label class="wd321213d3332_f233ef" data-test-id="position">Position:</label>
    <div>Software Engineer</div>
</div>

data-test-id or data-another-property are often occurring in HTML and being able to use them to write Cheerio selectors is nice, especially when class names are dynamic. Here is a good Cheerio selector for this HTML:

$('label[data-test-id="position"] + div').text()

Experiment with these 3 methods, useful for dynamic CSS classes, in Cheerio Sandbox

How to find elements without specific attributes in Cheerio?

Sample HTML:

 <div class="content">This div has a class attribute</div>
 <div>This div does not have a class attribute</div>
 <div class="footer">This div also has a class attribute</div>

Let's find all div elements without a class attribute using the :not pseudo-class and the attribute selector:

const divsWithoutClass = $('div:not([class])');

Writing Cheerio selectors like a Pro

Getting comfortable with all methods of accessing data through Cheerio is important and useful skill, especially in web scraping.

Here is a sample HTML:

<html>
  <body>
    <table>
      <tr>
        <td>First row title</td>
        <td>First row value</td>
      </tr>
      <tr>
        <td>Second row title</td>
        <td>Second row value</td>
      </tr>
    </table>
  </body>
</html>

Let's imagine we need to access "Second row value" string here and we know that the table always has only 2 rows. How can we do that?

Here are 3 ways to do it!

1. Access element by CSS index

$('table tr:eq(1) td:eq(1)').text().trim()

2. Access element by functional traversing, using next()

$('table').find('td').parent().next().find('td').eq(1).text()

3. Access element by text value of header

$('td:contains("Second row title") + td').text()

If you know how to use all 3, you are ready for some real life Cheerio projects! Do you know another method? Let me know and I will be happy to add it!

Extracting JSON for Linking Data using Cheerio

JSON For Linking Data is essentially a standartized big JSON object embedded into some web pages. It is a very useful piece of information, because it is essentially just machine readable pure data in JSON format: easy to extract & analyze. Read more about JSON for Linking Data on its website.

let ldNode = $("script[type='application/ld+json']:first"); 
if (ldNode[0] && ldNode[0].children) {
  graph = JSON.parse(ldNode[0].children[0].data);
  console.log(graph['@type']); // access LD JSON properties
}

Switching between two parsing modes

Cheerio has 2 parser engines under the hood: Very fast parse5 (default) and forgiving htmlparser2 .

Usually, you don't need to worry about which parser to use, until you start getting some weird errors when extracting data. In my case, I got issue with JSON-LD escaping of special characters - as far as I remember, it was & (ampersand) symbol in some string inside JSON, it was breaking things for Cheerio so it could not extract full JSON object. Switching to htmlparser2 made my extractor around 50-70% slower (but it was still sub-200ms) but fixed the error:

const $ = cheerio.load(html, { xml: true });

So, passing xml: true during Cheerio initialization switches the parser to slower htmlparser2.

Why Cheerio in 2024? Puppeteer already won the scrapers battle, didn't it?

If you use Puppeteer for rendering the websites, there is still room for Cheerio! For example, I prefer to separate the rendering process (because Puppeteer code quickly gets messy and complex) and the extraction process in ScrapeNinja project (which provides the tooling for fast and effective website scraping), even if it means some small performance hit (since you process all the DOM tree of the website twice this way – when Puppeteer renders the website, and then when you pass all the HTML to Cheerio to build the DOM structure again).

So, in ScrapeNinja, I can use Puppeteer to open the website, and interact with it, and then I write and pass the so called extractor function, which uses Cheerio to extract all the data from the HTML. This works pretty well so far!

Short tutorial where ScrapeNinja Cheerio selectors are used for web scraping, in ScrapeNinja Live Sandbox:

Bonus: interactive CSS selectors puzzles

Puzzle #1: extracting data from tables with generated CSS classes

https://scrapeninja.net/scraper-sandbox?slug=task-random-css-class

Solution: https://scrapeninja.net/scraper-sandbox?slug=task-random-css-class-solution

###