webscraping

Web scraping is a topic in which I am deeply interested. I have completed numerous successful web scraping projects and have launched several products in this field. Most of these projects were developed using Node.js, Puppeteer, cURL, and Playwright.

Building n8n web crawler for RAG

This week, I’m introducing a new project at ScrapeNinja: a recursive web crawler, packed into an n8n community node. It isn’t just another scraper - it’s an advanced, powerful open-source tool that executes in your local n8n instance and can be used to harvest huge amounts of

Web scraping in n8n

I am a big fan of n8n and I am using it for a lot of my projects. I love that it provides a self-hosted version and this self-hosted version is not paywalled like if often happens with so-called "open core" products which just use "open source" as a marketing

How to set proxy in Python Requests

Introduction As a seasoned developer with a keen interest in web scraping and data extraction, I've often leveraged Python for its simplicity and power. In this realm, understanding and utilizing proxies becomes a necessity, especially to navigate through the complexities of web requests, IP bans, and rate limiting. In this

How to set proxy in Playwright

In this article I will describe how to set a proxy in Playwright (Node.js version of Playwright). Playwright is obviously one of the best and most modern solutions to automate browsers in 2024. It uses the CDP protocol to send commands to browsers and supports Chromium, Chrome and Firefox

Modern web scraping with Playwright: choosing between Python and NodeJS

When diving into the world of automated browser testing and scraping with Playwright, one of the first decisions you'll encounter is the choice of programming language. Playwright is not a one-language wonder; it caters to a polyglot audience. Let's see how Node.js and Python version of Playwright compare. A

Blocking images in Playwright

Blocking unnecessary resources in Playwright is a pretty easy task, thanks to builtin route() function.