In case you want to import some random website data into Google Sheets, the obvious way to start this exciting adventure is to use
importXML() function. The main advantage is that this function is available in Google Sheets out of the box!
The syntax for the IMPORTXML function is:
- url refers to the URL of the website. You can represent this value in the formula by including the protocol “http://” or “https://” and URL between double quotation marks or a reference to a cell containing the corresponding text.
- xpath_query is the parameter in charge of telling the function what data to import. For example, if you define this parameter as “//h1/@title”, you’ll get all the attributes from <h1> tags on the website.
If you haven't tried it, create a new Google Sheet and paste this into some random cell, and hopefully in a second or two you will see some data from Wikipedia appearing in your sheet.
ImportXML for web scraping?
For a lot of cases, ImportXML function is immensely useful. Take a look at StackOverflow quesions and answers on this topic, people are doing useful and non-obvious stuff there! The major problems with ImportXML are:
- Custom processing and conversions of data is painful in Google Sheets. Let's say you want to import just
- There is no proper launch and cache control in ImportXML. There is no "scrape now" button in your Google Sheet. You can't control how often the
ImportXMLis triggered and if you have many cells populated with
importXML, it is very easy to occasionally trigger an avalanche of external http calls when opening and editing your sheet. There are no caching mechanisms in Google Sheets, so if the last call to ImportXML fails, the cell will get
ImportXMLonly works with basic websites (no SPAs rendered in browsers can be scraped this way, any basic web scraping protection or connectivity issue breaks the process, no control over HTTP request geo location, or number of retries)
Alternative #1: Google App Script
Thankfully, some real programming is possible in Google Sheets! It is done using Google Apps Script
There are certain pre-made Google Apps Scripts which help to extract data, the most famous is custom ImportJSON script (currently unmaintained). If you need to work with HTML and extract some nodes there (think: use jQuery selectors), there is also the cheerio for Google Apps Script.
The basic idea is that you copy&paste the code of custom script into Google Sheets script (or, include the external script as a library) and then use some new formula (`ImportJSON`) in your cells. The good thing about Google Apps Script is its flexibility and triggers: so, for example, you can setup a script to be triggered every hour. You can even add custom menu items to Google Sheets menu with it! This approach still suffers from some of the issues of
- Unfortunately, it is not too easy to develop and debug Apps Script in Google App Script IDE.
Alternative #2: Automation frameworks + ScrapeNinja
Here is the video of using Pipedream and ScrapeNinja.net to extract HackerNews titles to Google Sheets every hour:
Important thing to understand is that ScrapeNinja is an API-first solution which can also be used without Pipedream, just with Google Apps Script (since Google Apps Script can execute API requests). This might be a nice hybrid approach to tackle the problem.