[DISCONTINUED] Scraping Instagram in 2021: avoiding 302 and 429 errors
Disclaimer: I don't do Instagram scraping anymore, and none of the approaches mentioned below are working reliably now, I think.
Instagram is a tough target for scraping.
For one of my side projects, I needed to get information from several public accounts, on a daily basis – for example, their followers counts, and their recent posts. I tried to use most popular Github scrapers like https://github.com/realsirjoe/instagram-scraper
and https://github.com/postaddictme/instagram-php-scraper
on DigitalOcean droplet, and it quickly turned out Instagram either redirects to /login location, or throws 429 The maximum number of requests per hour has been exceeded
though it was the first request to its GraphQL endpoint. Apparently, all datacenter ip ranges have been banned by Instagram. Issues about 302 and 429 errors are created on Github issue queues almost every day so I definitely was not alone.
I did not want to log in into some fake Instagram account, because scraping via account will probably violate Instagram Terms and is not the most ethical thing to do. I also did not want & need to do shady things like mass following or anything like that, and public accounts information is what I was interested in.
So, I've purchased several proxies via Luminati (which was later acquired by Bright data) and other providers like smartproxy with bad results - even their expensive "residential" proxies were all banned by Instagram pretty much like my own datacenter ones, so I was getting 302 redirect again in 80-90% of hits. I've implemented retries on 302 redirect and timeouts but this resulted in really poor response times - I could get around 15-25 seconds average response time.
It turned out, there exists a solution to the problem – the unofficial Instagram API https://rapidapi.com/restyler/api/instagram40 which uses residential proxy networks and smart retries to bypass Instagram restrictions. It helped me to build my project, and it still works good during more than 3 months, so it looks pretty stable to me. Around 3-4% of requests end with 5xx errors, but it's an explicit error that is instantly visible to my software – so I can just retry failed requests once in a while, and considering the situation with Instagram strict policy, and comparing to other solutions it's just perfect. Proxified PHP scraper (uses this RapidAPI provider under the hood) is available on Github: https://github.com/restyler/instagram-php-scraper (it is a fork of postaddictme/instagram-php-scraper
which was mentioned above)
How to scrape Instagram in 2021: step by step
- Sign up on RapidAPI. RapidAPI is a big marketplace where developers submit their APIs and I am really excited with this platform, since it embraces divide&conquer approach: it allows app developers to focus on what their end customers need, delegating part of the work to other developer solutions. The best part about RapidAPI is that their API explorer allows you to subscribe&test several APIs to see how they perform in real time, and quickly decide if specific API is good enough for your use case. It is especially easy for APIs which provide free plans.
- Subscribe to specific API on the RapidAPI marketplace. I recommend https://rapidapi.com/restyler/api/instagram40 for Instagram API.
- Use the API. For this Instagram API, ready-made PHP solution is available on Github: https://github.com/restyler/instagram-php-scraper , but of course you can also just implement API in your own code.
Cheers!
UPD July 2021: Instagram API still works fine for my project. I've tried several other API solutions but returned back to instagram40 API due to its stability (other RapidAPI instagram providers just could't achieve good stability and latency, and downtimes could be days – with zero response from the vendor)
UPD August 2021: I've recorded a simple tutorial on how to scrape Instagram followers and enrich them with profile data:
The repo with the code used in this how-to guide:
https://github.com/restyler/ig_scraper