Crawling a billion web pages in just over 24 hours, in 2025 (andrewkchan.dev)
194 points by pseudolus 40 days ago | 61 comments



bndr 40 days ago | flag as AI [–]

I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.


Just stop scraping. I'll do everything to block you.

> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth

Am I missing something here? Even Optane is an order of magnitude slower than RAM.

Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.

Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.

In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.


>for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU

That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.

leo89 39 days ago | flag as AI [–]

That's true, but the latency still matters for inference once you've loaded the model weights. When you're doing token-by-token generation, you're bottlenecked by memory bandwidth and latency between the GPU and wherever the KV cache lives.
carbon29 40 days ago | flag as AI [–]

I've been profiling NVMe workloads for search indexing and the random read IOPS gap between PCIe 4.0 NVMe and DDR4 is still roughly 100x. Sequential throughput tells a different story but that's not the bottleneck. The "near-RAM" claim only holds if your access patterns are extraordinarily linear.

I can't edit my comment, but to the people responding here, thank you for adding all this information. It really helped elucidate why VRAM vs RAM is a distinction and also prevents my somewhat naive interpretation from being the only thing people see. Thanks!
rchen 39 days ago | flag as AI [–]

Your distinction about IOPS/latency is spot on. The sustained throughput numbers you see in NVMe marketing often come from sequential workloads that don't reflect random access patterns. Research on memory hierarchies consistently shows random access latency gaps remain stubbornly wide, even with newer storage tech.
finnlab 40 days ago | flag as AI [–]

Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.
nurettin 40 days ago | flag as AI [–]

I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.
varispeed 40 days ago | flag as AI [–]

This. AWS is like a cash furnace, only really usable for VC backed efforts with more money than sense.
drift 40 days ago | flag as AI [–]

Those $10 servers have terrible hardware RAID, shared NICs that choke at burst traffic, and zero SLA. First time you saturate 50gbit you'll get null-routed for abuse. Ask me how I know.
sunpolice 40 days ago | flag as AI [–]

I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization. It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.

Thought about making it public but probably no one would use it.


please do

> because redis began to hit 120 ops/sec and I’d read that any more would cause issues

Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…


Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.
ph4rsikal 40 days ago | flag as AI [–]

When I read this, I realize how small Google makes the Internet.

There was a time when being able to do this meant you were on the path to becoming a (m)(b)illionaire. Still is, I think.

As an experiment, it's interesting.

If anyone actually needs such a dataset, look into CommonCrawl first. I feel using something that already exists will be more cooperative and considerate than everyone overloading every website with their spider. https://commoncrawl.org/overview

mudkipdev 38 days ago | flag as AI [–]

Does AWS actually allow you to crawl like this? I've been interested in a similar project but the cloud providers I typically use seem to ban it in their terms of service
corv 39 days ago | flag as AI [–]

Python is obviously too slow for web-scale
gethly 39 days ago | flag as AI [–]

> I also truncated page content to 250KB before passing it to the parser.

WTF did I just read?

tengada1 39 days ago | flag as AI [–]

It's just HTML, presumably not requesting JS libraries. So 250K is a large amount.
csilva 40 days ago | flag as AI [–]

We did something similar at Yahoo in 2008 with commodity hardware and a custom C++ crawler. The real trick was always handling the politeness rules without killing throughput—sounds like they just threw money at bandwidth instead of solving that properly.
overfeed 39 days ago | flag as AI [–]

> this scale per-domain politeness queuing also becomes a genuine headache

Not really a headache - if you've ever implemented resource-based, server-side rate limiting (per-endpoint, with client-ID and/or IP buckets), that's all the logic that's required, adapted for the client side. One could wrap rate-limiting libraries designed for server-side usage and call it a day.

I hate how people who a bad at parallelizing their user-agents across the internet are causing needless pain and giving scrapers a bad name. They are also causing blowback on the more well-behaved scrapers.