For me, it is always a pain to write and test cheerio code unless I was doing it on the previous week. The syntax of cheerio is somewhat similar to jQuery, but this is still node.js, and not "real" DOM.

I was suffering every time I was googling for "Cheerio quick examples". Not anymore.

Cheerio Sandbox

I have built a cheerio sandbox to quickly test cheerio syntax against various test inputs. This is already insanely helpful for myself and saves me up to 15-30 minutes on every simple scraper I am writing, I think, just because I have working selectors samples at hand and I can quickly test my new selectors.

Try the sandbox here.

Cheerio Cheatsheet

Init cheerio DOM from html string

let $ = cheerio.load('<html><body><h1 id="title">Sample title</h1></body></html>');
// extract title
console.log($('#title').text());

In all code below c is a cheerio DOM object.

Replace multiple spaces with a single space

This is useful to cleanup html descriptions scraped from websites.

$('.descr').html().replace(/\s\s+/g, ' ').trim()

Iterate over children and return them as array of objects

#1: functional approach with map():

 let parsedItems = $('.first-list li').map(function() {
            return {
                descr: $('.descr', this).html().replace(/\s\s+/g, ' ').trim(),
                price: $('.price', this).text().replace('$', '')
            }
        }).toArray();

console.log(parsedItems);

Important: there are two ways to pass callbacks to iterate over children in cheerio. This works in map and each approach identically. First one, you can pass function()  and, and access current element this object. Or, you can pass arrow callback and in this case you will need to use second arg to access the current iterated element: (idx, elem) => { }  It is the matter of personal preference if you want to use arrow functions and remember that the element is the second arg, or use this object inside function()

Also important: do not forget .toArray() !

#2: imperative approach with each():

let items = $('.first-list li');

let parsedItems = [];
items.each((i, elem) => {
	parsedItems.push($(elem).text());
});

console.log(parsedItems);

Select by partial matching.

This is useful in case of dynamic classes generated by frontend frameworks, like some_class_xjeklj3:

$('[class^=some_class_]'); // select element by class starting with "some_class_" 

Why Cheerio in 2022? Puppeteer already won the scrapers battle, didn't it?

If you use Puppeteer for rendering the websites, there is still room for Cheerio! For example, I prefer to separate the rendering process (because Puppeteer code quickly gets messy and complex) and the extraction process in ScrapeNinja project (which provides the tooling for fast and effective website scraping), even if it means some small performance hit (since you process all the DOM tree of the website twice this way – when Puppeteer renders the website, and then when you pass all the HTML to Cheerio to build the DOM structure again).

So, in ScrapeNinja, I can use Puppeteer to open the website, and interact with it, and then I write and pass the so called extractor function, which uses Cheerio to extract all the data from the HTML. This works pretty well so far!