Building n8n web crawler for RAG

This week, I’m introducing a new project at ScrapeNinja: a recursive web crawler, packed into an n8n community node. It isn’t just another scraper - it’s an advanced, powerful open-source tool that executes in your local n8n instance and can be used to harvest huge amounts of data, for example I use it to consolidate technical documentation (many web pages) into a clean Markdown file that I can feed into a large language model (LLM) for retrieval augmented generation (RAG) and other advanced use cases.

Proceed to the code and installation instructions: https://github.com/restyler/n8n-nodes-scrapeninja

Video demo of ScrapeNinja n8n web crawler

Over the past few months, I’ve tested different approaches for cleaning raw HTML and creating efficient scraping pipelines using n8n. Since I have some experience building complex web scrapers using real coding, I quickly realized that n8n ecosystem can be significantly improved with better web scraping tools As a result, I have built ScrapeNinja n8n community node.
Read my blog post about web scraping in n8n - and in this post, I will talk about new feature of ScrapeNinja node: web crawler.

The Need for Specialized LLM Knowledge Bases

Large language models have changed how we interact with data, but they’re often missing domain-specific insights. By structuring documentation into a single Markdown file, we give LLMs the context they need to produce accurate, detailed answers.

I’ve observed that using a curated knowledge base fed into an LLM as a prompt (I am so glad we now get 128k+ context windows!) dramatically improves response quality and code generation. This realization led me to build a tool that automates the process of building knowledgebases from websites in a scalable, repeatable way.

Why n8n?

I love n8n: it is an open-source, self hosted workflow engine where complex automations can be built using low code. It has an amazing community.

Web Scraping Challenges

There are plenty of web scraping tutorials for n8n, but real world web scraping is much harder once you try to do something useful. Common issues include:

• Messy HTML Output: Raw HTML often includes scripts, styles, and irrelevant tags that confuse both humans and LLMs.

• Manual URL Management: Listing URLs by hand became tedious, especially as documentation grew.

• Resource-Heavy Operations: Spinning up new browser sessions for every page was slow and used too many resources.

I have built ScrapeNinja web scraping API, and I know how painful it is to maintain a web scraper.
As I have 10+ years of experience working as a software developer, most of my web scrapers are built using code (Node.js) but I have recently started using n8n more and more for my own tasks, since it is sometimes easier to maintain scrapers in n8n compared to maintaining a node.js application. Once something breaks, checking n8n execution logs is so much better compared to a situation when I realize that I don't even remember where my particular node.js process is running - as I literally have 10+ cloud servers. n8n also allows to use JS code anywhere and this is a very nice feature to have, compared to other no-code platforms which are "too no-code" for me. I don't like to feel restricted. n8n is also self-hosted and runs perfectly well via docker compose file. Just try it.

The Birth of the ScrapeNinja n8n Crawler

ScrapeNinja crawler was another evolutional step for me: I have already deployed ScrapeNinja scraping n8n node and got some experience building complex n8n nodes. Crawler is way more complex compared to my Scrape n8n operations which can be used to scrape a single page (included within the same n8n node).

Github repo of the Scrapeninja n8n community node

I developed the ScrapeNinja n8n crawler to address these issues by combining workflow automation with flexible scraping capabilities. Key highlights include:

• Recursive Traversal: It follows links based on URL rules until a page limit is reached or no new URLs are found.

• Hybrid Extraction Techniques: It uses the ScrapeNinja API for /scrape or /scrape-js (fast raw requests or full browser rendering), while a local “Extract primary content” operation runs inside the n8n node.

• Detailed Logging: All actions are logged in Postgres (via Supabase), simplifying debugging and performance checks.

Crawler node returns summary stats and MANY logs for debugging. They are also available in Postgres tables.

I even used it on ScrapeNinja’s own documentation to confirm it could produce a single Markdown file from multiple pages.

How It Works: A Technical Overview

1. Initialization: The crawler starts with a seed URL.

2. Page Fetching with ScrapeNinja API: Depending on your configuration, it calls /scrape for fast raw requests or /scrape-js for browser-rendered pages, handling JavaScript-heavy sites.

3. Local Content Extraction: Once a page is fetched, the primary content is extracted locally, removing scripts, ads, and unnecessary tags to yield clear Markdown text.

4. Recursive Link Extraction: The crawler extracts and queues links, continuing until it hits the page limit or runs out of new links.

5. Data Storage and Logging: Processed pages and logs (in JSON) are stored in a Postgres database, aiding in visibility and troubleshooting.

Things always go wrong in web scraping

Crawling is a really complex process, many things can go wrong. To mitigate this, crawler node logs everything. I mean, everything. Take a look at these log entries:

[
  {
    "level": "debug",
    "message": "Fetching page \"https://scrapeninja.net\" using ScrapeNinja",
    "metadata": {
      "url": "https://scrapeninja.net",
      "runId": 2
    },
    "created_at": "2025-02-06T11:36:12.011Z"
  },
  {
    "level": "debug",
    "message": "Selected URL \"https://scrapeninja.net\" for processing",
    "metadata": {
      "url": "https://scrapeninja.net",
      "depth": 0,
      "runId": 2,
      "queueId": 17
    },
    "created_at": "2025-02-06T11:36:12.046Z"
  },
  {
    "level": "debug",
    "message": "Sending request to https://scrapeninja.p.rapidapi.com/scrape",
    "metadata": {
      "url": "https://scrapeninja.net",
      "runId": 2,
      "engine": "scrape",
      "marketplace": "rapidapi"
    },
    "created_at": "2025-02-06T11:36:12.072Z"
  },
  {
    "level": "info",
    "message": "First page link analysis for \"https://scrapeninja.net\"",
    "metadata": {
      "runId": 2,
      "links_ignored": 10,
      "crawl_external": false,
      "links_included": 14,
      "exclude_patterns": [],
      "include_patterns": [],
      "total_links_found": 24,
      "sample_ignored_links": [
        "https://www.producthunt.com/posts/scrapeninja?utm_source=badge-top-post-badge&utm_medium=badge&utm_souce=badge-scrapeninja",
        "https://rapidapi.com/restyler/api/scrapeninja",
        "https://rapidapi.com/restyler/api/scrapeninja/pricing",
        "https://pipedream.com/apps/pipedream/integrations/scrapeninja",
        "https://apiroad.net/proxy",
        "https://pixeljets.com/blog/bypass-cloudflare/",
        "https://pixeljets.com/blog/browser-as-api-web-scraping/",
        "https://github.com/restyler/scrapeninja-api-php-client",
        "https://www.make.com/en/integrations/scrapeninja",
        "https://t.me/scrapeninja"
      ],
      "sample_included_links": [
        "https://scrapeninja.net/",
        "https://scrapeninja.net/docs/n8n/",
        "https://scrapeninja.net/scraper-sandbox?slug=hackernews",
        "https://scrapeninja.net/docs/",
        "https://scrapeninja.net/docs/proxy-setup/",
        "https://scrapeninja.net/curl-to-scraper",
        "https://scrapeninja.net/scraper-sandbox",
        "https://scrapeninja.net/cheerio-sandbox",
        "https://scrapeninja.net/docs/make.com/",
        "https://scrapeninja.net/openapi.yaml"
      ]
    },
    "created_at": "2025-02-06T11:36:13.554Z"
  },
  {
    "level": "debug",
    "message": "ScrapeNinja response info for \"https://scrapeninja.net\"",
    "metadata": {
      "url": "https://scrapeninja.net",
      "runId": 2,
      "pageTitle": "ScrapeNinja Web Scraping API: Turns websites into data, on scale. 🚀",
      "statusCode": 200
    },
    "created_at": "2025-02-06T11:36:13.580Z"
  },
  {
    "level": "debug",
    "message": "Queued 14 new URLs for crawling",
    "metadata": {
      "runId": 2,
      "parentUrl": "https://scrapeninja.net",
      "linksQueued": 14,
      "currentDepth": 0
    },
    "created_at": "2025-02-06T11:36:13.620Z"
  },
  {
    "level": "info",
    "message": "Successfully processed page \"https://scrapeninja.net\"",
    "metadata": {
      "url": "https://scrapeninja.net",
      "depth": 0,
      "run_id": 2,
      "status": "completed",
      "max_pages": 5,
      "latency_ms": 1448,
      "parent_url": null,
      "links_found": 24,
      "queue_stats": {
        "total": 15,
        "failed": 0,
        "pending": 14,
        "completed": 1
      },
      "links_queued": 14,
      "processed_pages": 1
    },
    "created_at": "2025-02-06T11:36:13.638Z"
  },
  {
    "level": "debug",
    "message": "Selected URL \"https://scrapeninja.net/\" for processing",
    "metadata": {
      "url": "https://scrapeninja.net/",
      "depth": 1,
      "runId": 2,
      "queueId": 18
    },
    "created_at": "2025-02-06T11:36:13.655Z"
  },
  {
    "level": "debug",
    "message": "Fetching page \"https://scrapeninja.net/\" using ScrapeNinja",
    "metadata": {
      "url": "https://scrapeninja.net/",
      "runId": 2
    },
    "created_at": "2025-02-06T11:36:13.673Z"
  },
  {
    "level": "debug",
    "message": "Sending request to https://scrapeninja.p.rapidapi.com/scrape",
    "metadata": {
      "url": "https://scrapeninja.net/",
      "runId": 2,
      "engine": "scrape",
      "marketplace": "rapidapi"
    },
    "created_at": "2025-02-06T11:36:13.677Z"
  },
  {
    "level": "debug",
    "message": "ScrapeNinja response info for \"https://scrapeninja.net/\"",
    "metadata": {
      "url": "https://scrapeninja.net/",
      "runId": 2,
      "pageTitle": "ScrapeNinja Web Scraping API: Turns websites into data, on scale. 🚀",
      "statusCode": 200
    },
    "created_at": "2025-02-06T11:36:14.862Z"
  },
  {
    "level": "debug",
    "message": "Queued 0 new URLs for crawling",
    "metadata": {
      "runId": 2,
      "parentUrl": "https://scrapeninja.net/",
      "linksQueued": 0,
      "currentDepth": 1
    },
    "created_at": "2025-02-06T11:36:14.885Z"
  },
  {
    "level": "info",
    "message": "Successfully processed page \"https://scrapeninja.net/\"",
    "metadata": {
      "url": "https://scrapeninja.net/",
      "depth": 1,
      "run_id": 2,
      "status": "completed",
      "max_pages": 5,
      "latency_ms": 1108,
      "parent_url": "https://scrapeninja.net",
      "links_found": 24,
      "queue_stats": {
        "total": 15,
        "failed": 0,
        "pending": 13,
        "completed": 2
      },
      "links_queued": 0,
      "processed_pages": 2
    },
    "created_at": "2025-02-06T11:36:14.901Z"
  },
  {
    "level": "debug",
    "message": "Selected URL \"https://scrapeninja.net/docs/n8n/\" for processing",
    "metadata": {
      "url": "https://scrapeninja.net/docs/n8n/",
      "depth": 1,
      "runId": 2,
      "queueId": 19
    },
    "created_at": "2025-02-06T11:36:14.916Z"
  },
  {
    "level": "debug",
    "message": "Fetching page \"https://scrapeninja.net/docs/n8n/\" using ScrapeNinja",
    "metadata": {
      "url": "https://scrapeninja.net/docs/n8n/",
      "runId": 2
    },
    "created_at": "2025-02-06T11:36:14.933Z"
  },
  {
    "level": "debug",
    "message": "Sending request to https://scrapeninja.p.rapidapi.com/scrape",
    "metadata": {
      "url": "https://scrapeninja.net/docs/n8n/",
      "runId": 2,
      "engine": "scrape",
      "marketplace": "rapidapi"
    },
    "created_at": "2025-02-06T11:36:14.935Z"
  },
  {
    "level": "debug",
    "message": "ScrapeNinja response info for \"https://scrapeninja.net/docs/n8n/\"",
    "metadata": {
      "url": "https://scrapeninja.net/docs/n8n/",
      "runId": 2,
      "pageTitle": "Using ScrapeNinja with n8n | ScrapeNinja",
      "statusCode": 200
    },
    "created_at": "2025-02-06T11:36:18.783Z"
  },
  {
    "level": "debug",
    "message": "Queued 9 new URLs for crawling",
    "metadata": {
      "runId": 2,
      "parentUrl": "https://scrapeninja.net/docs/n8n/",
      "linksQueued": 9,
      "currentDepth": 1
    },
    "created_at": "2025-02-06T11:36:18.816Z"
  },
  {
    "level": "info",
    "message": "Successfully processed page \"https://scrapeninja.net/docs/n8n/\"",
    "metadata": {
      "url": "https://scrapeninja.net/docs/n8n/",
      "depth": 1,
      "run_id": 2,
      "status": "completed",
      "max_pages": 5,
      "latency_ms": 3806,
      "parent_url": "https://scrapeninja.net",
      "links_found": 23,
      "queue_stats": {
        "total": 24,
        "failed": 0,
        "pending": 21,
        "completed": 3
      },
      "links_queued": 9,
      "processed_pages": 3
    },
    "created_at": "2025-02-06T11:36:18.833Z"
  },
  {
    "level": "debug",
    "message": "Selected URL \"https://scrapeninja.net/cdn-cgi/l/email-protection\" for processing",
    "metadata": {
      "url": "https://scrapeninja.net/cdn-cgi/l/email-protection",
      "depth": 1,
      "runId": 2,
      "queueId": 31
    },
    "created_at": "2025-02-06T11:36:18.860Z"
  },
  {
    "level": "debug",
    "message": "Fetching page \"https://scrapeninja.net/cdn-cgi/l/email-protection\" using ScrapeNinja",
    "metadata": {
      "url": "https://scrapeninja.net/cdn-cgi/l/email-protection",
      "runId": 2
    },
    "created_at": "2025-02-06T11:36:18.877Z"
  },
  {
    "level": "debug",
    "message": "Sending request to https://scrapeninja.p.rapidapi.com/scrape",
    "metadata": {
      "url": "https://scrapeninja.net/cdn-cgi/l/email-protection",
      "runId": 2,
      "engine": "scrape",
      "marketplace": "rapidapi"
    },
    "created_at": "2025-02-06T11:36:18.885Z"
  },
  {
    "level": "debug",
    "message": "ScrapeNinja response info for \"https://scrapeninja.net/cdn-cgi/l/email-protection\"",
    "metadata": {
      "url": "https://scrapeninja.net/cdn-cgi/l/email-protection",
      "runId": 2,
      "pageTitle": "Email Protection | Cloudflare",
      "statusCode": 200
    },
    "created_at": "2025-02-06T11:36:20.005Z"
  },
  {
    "level": "debug",
    "message": "Queued 0 new URLs for crawling",
    "metadata": {
      "runId": 2,
      "parentUrl": "https://scrapeninja.net/cdn-cgi/l/email-protection",
      "linksQueued": 0,
      "currentDepth": 1
    },
    "created_at": "2025-02-06T11:36:20.030Z"
  },
  {
    "level": "info",
    "message": "Successfully processed page \"https://scrapeninja.net/cdn-cgi/l/email-protection\"",
    "metadata": {
      "url": "https://scrapeninja.net/cdn-cgi/l/email-protection",
      "depth": 1,
      "run_id": 2,
      "status": "completed",
      "max_pages": 5,
      "latency_ms": 1113,
      "parent_url": "https://scrapeninja.net",
      "links_found": 4,
      "queue_stats": {
        "total": 24,
        "failed": 0,
        "pending": 20,
        "completed": 4
      },
      "links_queued": 0,
      "processed_pages": 4
    },
    "created_at": "2025-02-06T11:36:20.046Z"
  },
  {
    "level": "debug",
    "message": "Selected URL \"https://scrapeninja.net/curl-to-php\" for processing",
    "metadata": {
      "url": "https://scrapeninja.net/curl-to-php",
      "depth": 1,
      "runId": 2,
      "queueId": 30
    },
    "created_at": "2025-02-06T11:36:20.062Z"
  },
  {
    "level": "debug",
    "message": "Fetching page \"https://scrapeninja.net/curl-to-php\" using ScrapeNinja",
    "metadata": {
      "url": "https://scrapeninja.net/curl-to-php",
      "runId": 2
    },
    "created_at": "2025-02-06T11:36:20.082Z"
  },
  {
    "level": "debug",
    "message": "Sending request to https://scrapeninja.p.rapidapi.com/scrape",
    "metadata": {
      "url": "https://scrapeninja.net/curl-to-php",
      "runId": 2,
      "engine": "scrape",
      "marketplace": "rapidapi"
    },
    "created_at": "2025-02-06T11:36:20.090Z"
  },
  {
    "level": "debug",
    "message": "ScrapeNinja response info for \"https://scrapeninja.net/curl-to-php\"",
    "metadata": {
      "url": "https://scrapeninja.net/curl-to-php",
      "runId": 2,
      "pageTitle": "Convert cURL to PHP Web Scraper",
      "statusCode": 200
    },
    "created_at": "2025-02-06T11:36:21.557Z"
  },
  {
    "level": "debug",
    "message": "Queued 0 new URLs for crawling",
    "metadata": {
      "runId": 2,
      "parentUrl": "https://scrapeninja.net/curl-to-php",
      "linksQueued": 0,
      "currentDepth": 1
    },
    "created_at": "2025-02-06T11:36:21.583Z"
  },
  {
    "level": "info",
    "message": "Successfully processed page \"https://scrapeninja.net/curl-to-php\"",
    "metadata": {
      "url": "https://scrapeninja.net/curl-to-php",
      "depth": 1,
      "run_id": 2,
      "status": "completed",
      "max_pages": 5,
      "latency_ms": 1451,
      "parent_url": "https://scrapeninja.net",
      "links_found": 6,
      "queue_stats": {
        "total": 24,
        "failed": 0,
        "pending": 19,
        "completed": 5
      },
      "links_queued": 0,
      "processed_pages": 5
    },
    "created_at": "2025-02-06T11:36:21.599Z"
  },
  {
    "level": "info",
    "message": "Reached maximum pages (5), stopping crawler",
    "metadata": {
      "maxPages": 5,
      "processedPages": 5
    },
    "created_at": "2025-02-06T11:36:21.605Z"
  },
  {
    "level": "info",
    "message": "Crawler process completed for run \"2\"",
    "metadata": {
      "maxPages": 5,
      "processedPages": 5
    },
    "created_at": "2025-02-06T11:36:21.647Z"
  }
]

API-Powered vs. Local Extraction

A key design choice was splitting remote and local responsibilities:

• Remote Rendering with ScrapeNinja API: For speed or when a full browser environment is required, the /scrape or /scrape-js endpoints handle the page fetch.

• Local “Extract Primary Content” Operation: The actual text extraction runs locally inside n8n, which offers control over processing and reduces load on remote systems.

I want to note that currently you need a ScrapeNinja API key (ScrapeNinja is a cloud SaaS) to launch n8n web crawler - but you do not need ScrapeNinja API key for other locally-running operations. ScrapeNinja has a free plan.

For instance, "Clean up HTML content" and "Extract ..." operations do not require ScrapeNinja subscription, while"Scrape ..." and "Crawl" operations require ScrapeNinja API key.

This setup balances efficiency and flexibility, letting you tailor workflows to your specific requirements.

Getting Started with n8n web crawler: A Step-by-Step Guide

To set up the ScrapeNinja crawler in n8n:

1. Open n8n Dashboard: Go to Settings → Community Nodes.

2. Install the Node: Find and install n8n-nodes-scrapeninja, then restart n8n if prompted.

3. Check the Version: Use 0.4.0 or later for the ScrapeNinja n8n node. You’ll also need your ScrapeNinja API key.

4. Configure Your Workflow: Insert the crawler node, define the seed URL, and set parameters like the maximum page count.

5. Run and Monitor: Execute the workflow, then review logs via JSON output or the Postgres crawler_logs table in Supabase.

I had it running smoothly on my self-hosted instance in a short time.

Building Powerful Knowledge Bases

One standout application is compiling large documentation sets into a unified Markdown file:

• Your Own Docs: I tested it on ScrapeNinja documentation to confirm its capabilities.

• Aggregated Content: Core text from each page is combined into a single file, ideal for LLM ingestion.

• Enhanced LLM Responses: A dedicated knowledge base helps an LLM produce precise, contextual answers that generic training data cannot match.

Monitoring, Debugging, and Maintaining Transparency

Web crawling can take minutes or longer, so good monitoring is essential:

• JSON Output Logs: Every step is logged in real time, making it easy to integrate with external tools or review later.

• Postgres Crawler Logs: Logs are stored in a Postgres database (via Supabase) for persistent records of all operations.

• Resource Monitoring: Memory usage and performance are checked continuously to handle multiple pages in parallel without excessive load.

This has been crucial for debugging and performance reviews. In one test, memory stayed under 119 MB even with ten concurrent browser sessions.

Real-World Applications and Future Directions

Though initially focused on LLM knowledge bases, the crawler can be adapted for:

• E-commerce: Aggregate product data, reviews, and pricing from large catalogs.

• Content Aggregation: Collect articles or forum posts for comprehensive data sets.

• Market Analysis: Scrape competitor sites for trends and product insights.

• Academic Research: Pull data from journals, conferences, or public repositories.

Future improvements may include better error handling, more advanced parallel processing, and closer integration with vector databases for RAG. Community feedback is welcome to shape the project’s roadmap.

The ScrapeNinja Recursive Web Crawler for n8n is a result of practical insights from working on real-world scraping tasks. By combining the ScrapeNinja API (via /scrape or /scrape-js) with local extraction, it creates an efficient path to build domain-specific knowledge bases and improve LLM performance.

Try it on your self-hosted n8n setup. Install the node, configure it with your API key, and convert any web pages into structured, actionable data. Your feedback is welcome as I refine this project.