How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
> What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.
The simple answer is that atproto works like the web & search engines, where the apps aggregate from the distributed accounts. So the proper analogy here would be like yahoo going down in 1999.
Google and MSN Search were already available at this time. Also websites used to publish webrings and there was IRC and forums to ask people about things.
It’s more of a concept of a plan for being distributed. I even went through the trouble of hosting my own PDC and still, I was unable to use the service during the outage
> The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.
The comparison here is to something like TCP/IP. TCP/IP never goes down. TCP/IP is a protocol, the servers may go down and cause disruption, but the protocol doesn't really have the ability to "go down".
Nostr is also a protocol. The communication on top of Nostr is pretty resilient compared to other solutions though, so that's the main highlight here.
If tens of servers go down, then some people may start noticing a bit of inconvenience. If hundreds of servers go down, then some people may need to coordinate out of bound on what relays to use, but it still generally speaking works ok.
Wasn't aware there are ~2k relays now. Have inter-relay sharing situation improved?
When I tried it long time ago, the idea was just a transposed Mastodon model that the client would just multi-post to dozen different servers(relays) automatically to be hopeful that the post would be available in at least one shared relays between the user and their followers. That didn't seem to scale well.
IIRC a pole shift doesn't actually flip the geographic poles, just the magnetic ones -- so infrastructure would be fine. Though I'll grant the geomagnetic disruption could still wreak havoc.
"Never goes down" is the thing people say right before the 3am page. Distributed doesn't mean fault-tolerant. It means your failure modes are just more interesting.
Email and the internet don't have "downtime." Certain key infra providers do of course. ISPs can go down. DNS providers can go down. But the internet and email itself can't go down absent a global electricity outage.
You haven't built a decentralized network until you reach that standard imo. Otherwise its just "distributed protocol" cosplay. Nice costume. Kind of like how everybody has been amnesia'd into thinking Obsidian is open source when it really isn't.
Off-topic, but "real" feels like the new "delve". Is there such a thing as "fake" or "virtual" downtime, or why do people feel the need to specify that all manner of things are "real" nowadays?
Golang's use of a potentially unbounded number of threads is just insane. I used to be fairly bullish on golang, but this, combined with the fact that its garbage collected, makes me feel its just unsuitable for production use.
You can have this problem with any kind of thread -- including OS threads -- if you do an unbounded spawn loop. Go is hardly unique in this.
Goroutines are actually better AFAIK because they distribute work on a thread pool that can be much smaller than the number of active goroutines.
If my quick skim created a correct understanding, then the problem here looks more like architecture. Put simply: does the memcached client really require a new TCP connection for every lookup? I would think you would pool those connections just like you would a typical database and keep them around for approximately forever. Then they wouldn't have spammed memcache with so many connections in the first place...
(edit: ah, it looks like they do use a pool, but perhaps the pool does not have a bounded upper size, which is its own kind of fail.)
Rust's async doesn't have this issue. Or at least, it's the same issue as malloc in an unbounded loop, but that's a more general issue not related to async or threading.
15-20 thousand futures would be trivial. 15-20 thousand goroutines, definitely not.
We switched a service from Go to Rust async last year and the memory profile at scale was night and day. Futures really are lighter. Whether that translates to fewer connection issues is a separate question.
I don't know enough about rust to confirm or deny that -- but unless rust somehow puts a limit on in-flight async operations, I don't see how it would help.
The problem is not resource usage in go. The problem is that they created umpteen thousand TCP connections, which is going to kill things regardless of the language.
We hit this same thing. The fix was connection pooling on the memcached client -- we were accidentally creating a new connection per goroutine. After switching to a shared pool, goroutine count dropped 90%.
Why does garbage collection make it unsuitable for production use? A lot of production software is written in garbage collected languages like Java. Pretty much the entire backend for iTunes/Apple Music is written in Java, and it's not doing any kind of fancy bump allocator tricks to avoid garbage. In my mind, kind of hard to argue that Apple Music is not "production use".
There are certainly plenty of projects where garbage collection is too slow, but I don't know that they're the majority, and more people would likely prefer memory safety by default.
GC is fine until you have latency-sensitive workloads, which Bluesky clearly does. The pauses are non-deterministic. That's not a theoretical concern — it's exactly what bit them here.
Ran into the same issue with Scylla + memcached - once your cache cold-starts under load, the read amplification to Scylla just compounds. There's no graceful recovery without rate limiting the fallthrough.
That’ll do it.