webscraping

Web scraping is a topic in which I am deeply interested. I have completed numerous successful web scraping projects and have launched several products in this field. Most of these projects were developed using Node.js, Puppeteer, cURL, and Playwright.

Building n8n web crawler for RAG

This week, I’m introducing a new project at ScrapeNinja: a recursive web crawler, packed into an n8n community node. It isn’t just another scraper - it’s an advanced, powerful open-source tool that executes in your local n8n instance and can be used to harvest huge amounts of data, for example I use it to consolidate technical documentation (many web pages) into a clean Markdown file that I can feed into a large language model (LLM) for retrieval augmented generation (RAG) and other advanced use

14 min read

Web scraping in n8n

I am a big fan of n8n and I am using it for a lot of my projects. I love that it provides a self-hosted version and this self-hosted version is not paywalled like if often happens with so-called "open core" products which just use "open source" as a marketing term. Web scraping in n8n can be both simple and sophisticated, depending on your approach and tools. In this blog post, I will explore two ways of scraping: basic HTTP requests and advanced scraping techniques using ScrapeNinja n8n int

9 min read
How to set proxy in Python Requests

How to set proxy in Python Requests

Introduction As a seasoned developer with a keen interest in web scraping and data extraction, I've often leveraged Python for its simplicity and power. In this realm, understanding and utilizing proxies becomes a necessity, especially to navigate through the complexities of web requests, IP bans, and rate limiting. In this article, I'll share my insights and experiences on using proxies with Python's Requests library. We'll start from the basics and gradually move to more advanced techniques l

4 min read
How to set proxy in Playwright

How to set proxy in Playwright

In this article I will describe how to set a proxy in Playwright (Node.js version of Playwright). Playwright is obviously one of the best and most modern solutions to automate browsers in 2024. It uses the CDP protocol to send commands to browsers and supports Chromium, Chrome and Firefox browsers out of the box. It is open source and very well maintained. It's main use case is UI test automation and web scraping. Setting up proxies is useful for both of these use cases - especially for web scr

4 min read

Modern web scraping with Playwright: choosing between Python and NodeJS

When diving into the world of automated browser testing and scraping with Playwright, one of the first decisions you'll encounter is the choice of programming language. Playwright is not a one-language wonder; it caters to a polyglot audience. Let's see how Node.js and Python version of Playwright compare. A bit of a history Playwright was created by a guy who was one of authors of Puppeteer.js: Andrey Lushnikov (who was part of Chrome DevTools team back then). Playwright was built on the les

4 min read