Cheerio is a de-facto standard to parse HTML in a server-side Javascript (Node.js) now. It is a fast, flexible, and lean implementation of jQuery-like syntax designed specifically for the server.

Github: https://github.com/cheeriojs/cheerio Stars: 25.8K

NPM: https://www.npmjs.com/package/cheerio

Cheerio is a pretty performant solution to extract data from raw HTML web pages, and is perfect for web scraping tasks when you don't need real browser rendering or you just don't want to use Puppeteer / Playwright.

I have already built a number of Javascript web scrapers which parse HTML and extract JSON from it, but for me, writing and testing Cheerio code still takes a significant amount of time. The syntax of Cheerio is somewhat similar to jQuery, but it is still Node.js, and not "real" DOM and not real jQuery.

I was suffering every time I was googling for "Cheerio quick examples" and "Cheerio how to iterate over element children". Not anymore.

Cheerio Online Playground

I have built an online Cheerio sandbox to quickly test cheerio syntax against various test inputs. Think of this as a regex101 but for Cheerio selectors instead of regex strings. This is already insanely helpful for myself and saves me up to 15-30 minutes on every simple scraper I am writing, I think, just because I have working selectors samples at hand and I can quickly test my new selectors.

Try the sandbox here.

Cheerio Cheatsheet

Init cheerio DOM from html string

let $ = cheerio.load('<html><body><h1 id="title">Sample title</h1></body></html>');
// extract title
console.log($('#title').text());

In all code below $ is a cheerio DOM object.

Replace multiple spaces with a single space

This is useful to cleanup html descriptions scraped from websites.

$('.descr').html().replace(/\s\s+/g, ' ').trim()

Iterate over children and return them as array of objects

#1: functional approach with map():

 let parsedItems = $('.first-list li').map(function() {
            return {
                descr: $('.descr', this).html().replace(/\s\s+/g, ' ').trim(),
                price: $('.price', this).text().replace('$', '')
            }
        }).toArray();

console.log(parsedItems);

Important: there are two ways to pass callbacks to iterate over children in cheerio. This works in map and each approach identically. First one, you can pass function()  and, and access current element this object. Or, you can pass arrow callback and in this case you will need to use second arg to access the current iterated element: (idx, elem) => { }  It is the matter of personal preference if you want to use arrow functions and remember that the element is the second arg, or use this object inside function()

Also important: do not forget .toArray() !

#2: imperative approach with each():

let items = $('.first-list li');

let parsedItems = [];
items.each((i, elem) => {
	parsedItems.push($(elem).text());
});

console.log(parsedItems);

Select by partial matching.

This is useful in case of dynamic classes generated by frontend frameworks, like some_class_xjeklj3:

$('[class^=some_class_]'); // select element by class starting with "some_class_" 

Why Cheerio in 2022? Puppeteer already won the scrapers battle, didn't it?

If you use Puppeteer for rendering the websites, there is still room for Cheerio! For example, I prefer to separate the rendering process (because Puppeteer code quickly gets messy and complex) and the extraction process in ScrapeNinja project (which provides the tooling for fast and effective website scraping), even if it means some small performance hit (since you process all the DOM tree of the website twice this way – when Puppeteer renders the website, and then when you pass all the HTML to Cheerio to build the DOM structure again).

So, in ScrapeNinja, I can use Puppeteer to open the website, and interact with it, and then I write and pass the so called extractor function, which uses Cheerio to extract all the data from the HTML. This works pretty well so far!

Short tutorial where ScrapeNinja Cheerio selectors are used for web scraping, in ScrapeNinja Live Sandbox:

Bonus: interactive CSS selectors puzzles

Puzzle #1: extracting data from tables with generated CSS classes

https://scrapeninja.net/scraper-sandbox?slug=task-random-css-class

Solution: https://scrapeninja.net/scraper-sandbox?slug=task-random-css-class-solution