Crawling 100 million URLs without melting PostgreSQL
Architecture choices behind CrawlZilla — a Redis URL frontier, a partitioned PostgreSQL results store, PgBouncer in front, and a frontend that virtualises millions of rows.
2026-01-20 · 9 min read
CrawlZilla is an enterprise SEO crawler that regularly audits sites in the hundreds of millions of URLs. That number trips people up, so let me be concrete about what "100M URLs per crawl" actually means: every one of those URLs is fetched, parsed, stored with ~40 SEO data points, and made available for virtualised grid display, filtering, sorting, and change-tracking on subsequent crawls.
The interesting constraints aren't at the crawler — they're at the storage and the UI.
Service map
crawler-frontend— React 18 + Vite + Tailwind. All crawl UI.crawler-urlfrontier— Redis-backed URL queue. Dedupe, priority, politeness.crawler-crawlerworker— distributed fetch workers.crawler-jsrendererr— headless Chromium pool for JS rendering.crawler-dnscache— local resolver so hot-path lookups don't hammer upstream DNS.crawler-resultsapi— REST API in front of the results store, with pagination and filtering.crawler-pgbouncer— transaction pooling in front of PostgreSQL.crawler-checkpointservice— crawl state checkpointing so workers can be restarted without losing progress.
The frontier
URL dedupe and prioritisation live in Redis, not PostgreSQL. Every write to the queue is O(1); every dedupe check is O(1). The frontier is the only part of the system that cares about the total URL count in real time, and Redis is exactly the right tool for that shape of work.
Politeness rules (per-host concurrency, crawl delay, robots.txt) are enforced at the frontier too — workers ask for the next URL and the frontier hands them something they're allowed to fetch right now.
The results store
PostgreSQL holds crawl results. The table schema is intentionally wide — one row per URL, with ~40 columns covering indexability, canonical, response, content, structured data, headers, link counts, etc.
Two decisions make this viable at 100M+ rows:
- Partition by crawl. Each crawl session gets its own partition. Bulk inserts during a crawl don't fight with queries against old crawls. When a crawl is deleted, the partition is dropped — no slow
DELETEat scale. - PgBouncer in transaction mode. The results API is chatty; direct connections would blow the server's connection limit instantly. PgBouncer keeps the pool at a sane size and the API server is effectively stateless.
JS rendering as a pool, not a pipeline
Rendering JavaScript is expensive. You don't want to render every page — only the ones that actually need it. The crawler fetches plain HTML first and decides whether a render is worth the CPU budget based on signals: suspiciously empty content, known SPA frameworks, client routing markers.
When rendering is required, the worker sends the URL to the JS renderer pool and carries on with other URLs. The renderer is its own service with its own horizontal scale — nothing blocks on it.
The frontend is the surprise
Most of the hard decisions in CrawlZilla are actually in the React frontend, not the crawler. Here's why.
Users expect to browse millions of URLs in a Screaming-Frog-style grid — sort, filter, open detail panels — without the browser dying. That requires:
- Virtual scrolling. 25 px row height, 60-row overscan. The DOM only ever holds what's visible.
- Paged infinite scroll from the API. 2000 rows per page. The API is built around that page size.
- Session-gen concurrency guards. When the user switches crawl session, a ref is incremented. Every async fetch captures the current gen and discards its result if the gen has changed. Without this, an in-flight request for the old session can overwrite state for the new one. A synchronous null of the session ID ref before any
awaitcloses the last race. - Polling that adapts. 2s while a crawl is running, 5s while paused or complete, 10s for the global active-sessions list.
The crawl-compare feature is built on the same primitives: load the same pages for two sessions, diff them in memory, render the diff with the same virtualised grid. No extra infrastructure — just carefully structured client code.
What I'd tell anyone building this shape of system
- Put the fast-changing state in Redis. The URL frontier does not belong in a relational store.
- Partition early. Retrofitting partitioning onto a billion-row table is grim. Do it on day one.
- Pool your database connections. PgBouncer in transaction mode is not optional at this scale.
- Concurrency in the UI is a real engineering problem. Virtualisation, generation-guarded fetches, and adaptive polling are the boring, load-bearing parts.