Web scraping in Google Sheets: ImportXML & alternatives
In case you want to import some random website data into Google Sheets, the obvious way to start this exciting adventure is to use importXML()
function. The main advantage is that this function is available in Google Sheets out of the box!
IMPORTXML Syntax
The syntax for the IMPORTXML function is:
=IMPORTXML(url, xpath_query)
- url refers to the URL of the website. You can represent this value in the formula by including the protocol “http://” or “https://” and URL between double quotation marks or a reference to a cell containing the corresponding text.
- xpath_query is the parameter in charge of telling the function what data to import. For example, if you define this parameter as “//h1/@title”, you’ll get all the attributes from <h1> tags on the website.
Working example:
=IMPORTXML("https://en.wikipedia.org/wiki/Moon_landing", "//a/@href")
If you haven't tried it, create a new Google Sheet and paste this into some random cell, and hopefully in a second or two you will see some data from Wikipedia appearing in your sheet.
ImportXML for web scraping?
For a lot of cases, ImportXML function is immensely useful. Take a look at StackOverflow quesions and answers on this topic, people are doing useful and non-obvious stuff there! The major problems with ImportXML are:
- Custom processing and conversions of data is painful in Google Sheets. Let's say you want to import just
32.22
number from "cost: 32.22 USD" string extracted from some website. It is possible to do in Google Sheets, with some fiddling around formulas and scratching your head, but obviously it is much easier to accomplish this task Javascript or Python, especially if you have lots and lots of data to cleanup! - There is no proper launch and cache control in ImportXML. There is no "scrape now" button in your Google Sheet. You can't control how often the
ImportXML
is triggered and if you have many cells populated withimportXML
, it is very easy to occasionally trigger an avalanche of external http calls when opening and editing your sheet. There are no caching mechanisms in Google Sheets, so if the last call to ImportXML fails, the cell will getERR!
value. ImportXML
only works with basic websites (no SPAs rendered in browsers can be scraped this way, any basic web scraping protection or connectivity issue breaks the process, no control over HTTP request geo location, or number of retries)
Alternative #1: Google App Script
Thankfully, some real programming is possible in Google Sheets! It is done using Google Apps Script
There are certain pre-made Google Apps Scripts which help to extract data, the most famous is custom ImportJSON script (currently unmaintained). If you need to work with HTML and extract some nodes there (think: use jQuery selectors), there is also the cheerio for Google Apps Script.
The basic idea is that you copy&paste the code of custom script into Google Sheets script (or, include the external script as a library) and then use some new formula (`ImportJSON`) in your cells. The good thing about Google Apps Script is its flexibility and triggers: so, for example, you can setup a script to be triggered every hour. You can even add custom menu items to Google Sheets menu with it! This approach still suffers from some of the issues of ImportXML
unless you write higher level Javascript functions in Google App Script IDE:
- Unfortunately, it is not too easy to develop and debug Apps Script in Google App Script IDE.
- Complex websites (SPAs with heavy frontend Javascript, web scraping protected websites, websites which can be viewed only from specific countries) are still not available for scraping.
Alternative #2: Automation frameworks + ScrapeNinja
Another viable approach of web scraping to Google Sheets is to use Make.com, Zapier, or their developer-focused alternative called Pipedream.com, and this approach is pretty flexible, reliable AND low code. It requires two external web services and Google Sheets is basically used as data store only, all the logic is done in Pipedream, which is a low code automation platform for developers, and all the web scraping heavy lifting (data retrieval, extracting data from HTML, data post processing) is done in ScrapeNinja.net low code sandbox. The best part of the Pipedream is that this approach is infinitely more flexible and reliable than plain Google Sheets, as it's basically a direct Zapier competitor (but with better free plan, and low-code more than no-code, so some Javascript knowledge might be required).
Here is the video of using Pipedream and ScrapeNinja.net to extract HackerNews titles to Google Sheets every hour:
Another one, with Make.com:
Important thing to understand is that ScrapeNinja is an API-first solution which can also be used without Pipedream, just with Google Apps Script (since Google Apps Script can execute API requests). This might be a nice hybrid approach to tackle the problem.