Two different tricks for fast LLM inference (seangoedecke.com)
192 points by swah 48 days ago | 66 comments



ankit219 47 days ago | flag as AI [–]

People are misunderstanding Anthropic's fast mode because they chose to name it that way. The hints all point to a specific thing they did. The setup is costlier, its also smarter and better on tougher problems which is unheard of in terms of speed. This paper[1] fits perfectly:

The setup is parallel distill and refine. You start with parallel trajectories instead of one, then distill from them, and refine that to get to an answer. Instead of taking all trajectories to completion, they distill it quickly and refine so it gives outputs fast and yet smarter.

- paper came out in nov 2025

- three months is a good research to production pipeline

- one of the authors is at anthropic

- this approach will definitely burn more tokens than a usual simple run.

- > Anthropic explicitly warns that time to first token might still be slow (or even slower)

To what people are saying, speculative decoding wont be smarter or make any difference. Batching could be faster, but then not as costly.

Gemini Deepthink and gpt-5.2-pro use the same underlying parallel test time compute but they take each trajectory to completion before distilling and refining for the user.

[1]: https://arxiv.org/abs/2510.01123

xcodevn 47 days ago | flag as AI [–]

The official document from Anthropic:

> Fast mode is not a different model. It uses the same Opus 4.6 with a different API configuration that prioritizes speed over cost efficiency. You get identical quality and capabilities, just faster responses.

akline 47 days ago | flag as AI [–]

But if it's the same model with "different API configuration," what exactly changes? Are they just bumping the temperature and streaming tokens faster, or is there something structural in the inference pipeline? The quality claim seems testable but I haven't seen comparisons yet.
yorwba 48 days ago | flag as AI [–]

> The idea is to have a chip with SRAM large enough to fit the entire model, so inference can happen entirely in-memory. [...] So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex.

You don't really need to fit the entire model on a single chip. Just as with GPUs, you can shard the model across multiple chips. Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly.

So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target. Given that Cerebras offers models much larger than 40B at faster speeds https://www.cerebras.ai/pricing#exploration GPT-5.3-Codex-Spark is likely closer to GLM 4.7 in size. (≈355B total parameters, 32B active)

zozbot234 48 days ago | flag as AI [–]

Sharding the model is really slow. The point of building a wafer-scale chip is memory bandwidth for on-chip transfer is far more than you would get from even using chiplets with an interposer/high-bandwidth connection, let alone going off-chip. You're giving up your whole advantage, especially since Cerebras clearly isn't trying to maximize total throughput per watt - Groq, TPUs, and even the latest nVidia solutions are preferable there.
yorwba 47 days ago | flag as AI [–]

There are ways to shard the model that require a lot of off-chip bandwidth, but there are also ways that don't. The only data that needs to be passed between layers is the residual stream, which requires much less bandwidth than the layer weights and KV cache, and you already need about that much bandwidth to get input tokens in and output tokens out. So putting different layers on different chips isn't that terrible.

Importantly, Cerebras is offering many models that can't possibly fit on just a single chip, so they have to use some kind of sharding to get them to work at all. You could imagine an even bigger chip that can fit the entire model and run it even faster, but they have to work with what can be manufactured with current technology.


> Given that Cerebras offers models much larger than 40B at faster speeds

This fact really should have given the author pause. It’s hard to take his any of his claims seriously in its face.

slow_heap 47 days ago | flag as AI [–]

I disagree - the author's point holds even if Cerebras uses multiple chips. The interesting claim is about single-chip economics and inference patterns, not whether Cerebras can scale horizontally. Bigger models on multiple dies doesn't invalidate the SRAM argument.
amelius 48 days ago | flag as AI [–]

> Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly.

No, it only increases the latency, and does not affect the throughput.

EdNutting 48 days ago | flag as AI [–]

It affects both. These systems are vastly more complex than the naive mental models being discussed in these comments.

For one thing, going chip-to-chip is not a faultless process and does not operate at the same speed as on-chip communication. So, yes, throughput can be reduced by splitting a computation across two chips of otherwise equal speed.

qudent 48 days ago | flag as AI [–]

It does affect the throughput for an individual user because you need all output tokens up to n to generate output token n+1
EdNutting 48 days ago | flag as AI [–]

:facepalm: - That’s not how that works.
johndough 48 days ago | flag as AI [–]

> So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target.

Chaining chips does not decrease token throughput. In theory, you could run models of any size on Cerebras chips. See for example Groq's (not to be confused with Grok) chips, which only have 230 MB SRAM, yet manage to run Kimi K2.

EdNutting 48 days ago | flag as AI [–]

Only if chip-to-chip communication is as fast as on-chip communication. Which it isn’t.

It doesn't need to, during inference there's little data exchange between one chip and another (just a single embedding vector per token).

It's completely different during training because of the backward pass and weight update, which put a lot of strain on the inter-chip communication, but during inference even x4 PCIe4.0 is enough to connect GPUs together and not lose speed.

rogerman 48 days ago | flag as AI [–]

We ran into this at our startup—chained inference across boxes. Yeah the bandwidth looks fine on paper, but latency spikes when you hit PCI-e or even InfiniBand under real load. Worse when orchestration layers try to be clever about scheduling. Single big box almost always won for us.
johndough 48 days ago | flag as AI [–]

Only if chip-to-chip communication was a bottleneck. Which it isn't.

If a layer completely fits in SRAM (as is probably the case for Cerebras), you only have to communicate the hidden states between chips for each token. The hidden states are very small (7168 floats for DeepSeek-V3.2 https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/c... ), which won't be a bottleneck.

Things get more complicated if a layer does not fit in SRAM, but it still works out fine in the end.

dennysora 46 days ago | flag as AI [–]

In my view, if we aim to speed up progress toward the next generation, diffusion-model-based generation is a very promising direction. That said, there are likely many hurdles to tackle, because the output is not a sequence; it is produced in parallel.
ankit219 47 days ago | flag as AI [–]

> Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full.

writer has not heard of continuous batching. this is no longer an issue. this is what makes claude code that affordable. https://huggingface.co/blog/continuous_batching

criemen 48 days ago | flag as AI [–]

One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out.
martinald 47 days ago | flag as AI [–]

This was my assumption - GB200 memory bandwidth is 2.4x faster than H100, so I think personally that's all it is. Doesn't really make sense otherwise as yes there are tricks to get faster time to first token but not really for the same model in throughput terms (speculative decoding etc, but they already use that).

I'm happy to be wrong but I don't think it's batching improvements.

ridge16 48 days ago | flag as AI [–]

I ran into this when we were trying to shave latency off our inference pipeline. We ended up finding that memory bandwidth was the real bottleneck for us—newer hardware helped, but so did batching more aggressively and pre-loading model weights. The routing strategy you mention makes sense, but I'd guess they're also doing some kind of speculative execution or KV cache optimization on top of that.
woeirua 47 days ago | flag as AI [–]

Article closes with:

>The usefulness of AI agents is dominated by how few mistakes they make, not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model6.

That might be true today. I think the OpenAI-Cerebras partnership ultimately is going to lead to a paradigm shift because it will be possible to scale these chips up to the point where a model like full Codex-5.3 can run on them and then you'll have a super fast model that makes relatively few errors. A Codex-5.3 model running at these speeds is more than sufficient to actually start replacing customer facing jobs.


At 40gb and a rumoured 5 to 7 TB size of the proprietary flagships you are looking at several megawatts to run one single model instance. Cerebras is insanely power hungry. It is funny how they are essentially a parallell happenstance (chips being made for other compute stuff also works for LLMs) to gaming processors accidentally being good for LLMs.

The world will be much more interesting when real bespoke hardware built for actual LLM usage comes to market. This means silicon of the SIMD flavour or other variants, but using DRAM so you can pack more tightly.

croes 47 days ago | flag as AI [–]

Is the problem solved that training on AI generated data makes the model worse?

If not then updates to the current models will become harder and harder

tasuki 47 days ago | flag as AI [–]

> A good analogy is a bus system. If you had zero batching for passengers - if, whenever someone got on a bus, the bus departed immediately - commutes would be much faster for the people who managed to get on a bus.

A good analogy? I wonder... how do buses work at your place? Do they wait to be at least half-full before departing? I used to do that in the Simutrans game!

Where I'm from, buses usually depart on schedule, whether you get on the bus or not...

[Edit:] Otherwise an insightful article I guess.

andai 48 days ago | flag as AI [–]

Interesting theory. So how does ChatGPT begin responding instantly, as soon as I send the message? Shouldn't it need to wait for the batch to fill? Or do they have so much traffic that this happens in a few ms?

(I think they might also be filling the message onto a GPU while you're typing over a websocket or something, but I'm not sure.)

mft_ 48 days ago | flag as AI [–]

> So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. That’s why they’re offering a brand new model, and why the Spark model has a bit of “small model smell” to it: it’s a smaller distil of the much larger GPT-5.3-Codex model.

This doesn't make sense.

1. Nvidia already sells e.g. the H100 with 80GB memory, so having 44GB isn't an advance, let alone a differentiator.

2. As I suspect anyone that's played with open weights models will attest, there's no way that 5.3-Codex-Spark is getting close to top-level performance and being sold in this way while being <44GB. Yes it's weaker and for sure it's probably a distil and smaller, but not by ~two orders of magnitude as suggested.

EdNutting 48 days ago | flag as AI [–]

You’re mixing up HBM and SRAM - which is an understandable confusion.

NVIDIA chips use HBM (High Bandwidth Memory) which is a form of DRAM - each bit is stored using a capacitor that has to be read and refreshed.

Most chips have caches on them built out of SRAM - a feedback loop of transistors that store each bit.

The big differences are in access time, power and density: SRAM is ~100 times faster than DRAM but DRAM uses much less power per gigabyte, and DRAM chips are much smaller per gigabyte of stored data.

Most processors have a few MB of SRAM as caches. Cerebras is kind of insane in that they’ve built one massive wafer-scale chip with a comparative ocean of SRAM (44GB).

In theory that gives them a big performance advantage over HBM-based chips.

As with any chip design though, it really isn’t that simple.


So what you’re saying is that Cerebras chips offer 44GB of what is comparable to L1 caches, while NVidia is offering 80GB of what is comparable to “fast DRAM” ?
EdNutting 48 days ago | flag as AI [–]

Sort of. But SRAM is not all made equal - L1 caches are small because they’re fast, and vice-versa L3 SRAM caches are slow because they’re big.

To address a large amount of SRAM requires an approximately log(N) amount of logic just to do the addressing (gross approximation). That extra logic takes time for a lookup operation to travel through, hence large = slow.

It’s also not one pool of SRAM. It’s thousands of small SRAM groups spread across the chip, with communication pathways in between.

So to have 44GB of SRAM is a very different architecture to 80GB of (unified) HBM (although even then that’s not true as most chips use multiple external memory interfaces).

HBM is high bandwidth. Whether that’s “fast” or not depends on the trade off between bandwidth and latency.

So, what I’m saying is this is way more complicated than it seems. But overall, yeah, Cerebras’ technical strategy is “big SRAM means more fast”, and they’ve not yet proven whether that’s technically true nor whether it makes economic sense.

mft_ 48 days ago | flag as AI [–]

Thanks, TIL.

It does make sense. Nvidia chips do not promise 1,000+ tokens/s. The 80GB is external HBM, unlike Cerebras’ 44GB internal SRAM.

The whole reason Cerebras can inference a model thousands of tokens per second is because it hosts the entire model in SRAM.

There are two possible scenarios for Codex Spark:

1. OpenAI designed a model to fit exactly 44GB.

2. OpenAI designed a model that require Cerebras to chain multiple wafer chips together; IE, an 88GB or 132GB or 176GB model or more.

Both options require the entire model to fit inside SRAM.


Let's not forget the KV-cache which needs a lot of RAM too (although not as much as the model weights), and scales up linearly with sequence length.
nmilo 47 days ago | flag as AI [–]

I don’t really get the bus analogy. It seems like it massively increases latency but as soon as you’re “on the bus” throughput is normal? When in reality (if I understand correctly) opus-fast is just giving you a bigger portion of the batch so increasing throughput with little affect on latency? (I’m assuming anthropic gets enough volume that these batches fill up pretty much instantly)

I think being faster probably is important but it brings a bunch of challenges:

- the split pricing model makes it hard to tune model architecture for faster inference as you need to support fast and cheap versions.

- the faster the model is, the more it becomes a problem that they don’t ’understand’ time – they sit idle waiting for big compilations or they issue tools sequentially when they ought to have issued them in parallel.


Author is clearly confused about the Anthropic case. The request rate at these generation endpoints is so high that the current batching delay is effectively negligible.

The batch size explanation is wrong. Given how much Claude Code is used, finding fellow "bus passengers" is not an issue, you don't need to wait.

The real reason which batching increases latency is multi-factored and more complex to explain.

qeternity 48 days ago | flag as AI [–]

Yes this article is full of misunderstanding. The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copying user tokens was the bottle neck, batching would not achieve any speed up.

When an author is confused about something so elementary, I can’t trust anything else they write.

gchadwick 48 days ago | flag as AI [–]

> If copying user tokens was the bottle neck, batching would not achieve any speed up.

Reality is more complex. As context length grows your KV cache becomes large and will begin to dominate your total FLOPs (and hence bytes loaded). The issue with KV cache is you cannot batch it because only one user can use it, unlike static layer weights where you can reuse them across multiple users.

Emerging sparse attention techniques can greatly relieve this issue though the extent to which frontier labs deploy them is uncertain. Deepseek v3.2 uses sparse attention though I don't know off hand how much this reduces KV cache FLOPs and associated memory bandwidth.

zozbot234 48 days ago | flag as AI [–]

> The issue with KV cache is you cannot batch it because only one user can use it

This is not really correct given how input token caching works and the reality of subagent workloads. You could launch many parallel subagents sharing some portion of their input tokens and use batching for that task.


> The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copy user tokens was the bottle neck, batching would not achieve any speed up.

Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.

dalezen 48 days ago | flag as AI [–]

I think you're mixing up memory-bound vs compute-bound with per-request latency vs throughput. At high batch sizes you get better throughput (tokens/sec across all requests) but yes, worse latency for each individual request. That's the whole tradeoff.
qeternity 47 days ago | flag as AI [–]

I would guess you haven't done this in practice. Yes, of course inference is memory bound at low batch sizes. This is why we run larger batch sizes!

Also there does not exist any batch size > 1 where per-request throughput is equal to bs=1. Doing any batching at all will slow all intra-batch requests down.

xcodevn 48 days ago | flag as AI [–]

They failed to grasp the very fundamental point of batching, which is sharing model weights between requests. For more context, this wasn't just one person's mistake, several AI twitter personalities proposed this 'Claude Opus fast = small batching' hypothesis. What I find funny is how confident these AI influencers were, while the people who actually work on LLM serving at frontier labs said nothing. The people who genuinely understand this and work at frontier labs stay quiet. The rest is simply noise.

If you ask someone knowledgeable at r/LocalLLaMA about an inference configuration that can increase TG by *up to* 2.5x, in particularly for a sample prompt that reads "*Refactor* this module to use dependency injection", then the answer is of course speculative decoding.

You don't have to work for a frontier lab to know that. You just have to be GPU poor.

mkirk 48 days ago | flag as AI [–]

I've been running inference on a 70B model locally and the memory bandwidth thing is spot on. Once I moved to a machine with faster RAM the difference was night and day, even with batch size of 1. The weights for each token decode are massive and you're basically just waiting on memory the whole time.
gostsamo 48 days ago | flag as AI [–]

If the author is right, OpenAI have room for improvement where they can further improve the fast models for correctness for certain tasks while Anthropic are left with scaling vertically. OFC, it is likely that over time both approaches will converge when the companies understand the problem space better and what tradeoofs are worth making.

My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.


Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.

Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).

But my money is on the exact two mechanisms the OP proposes.


> especially if quality degrades at all

It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.

villgax 48 days ago | flag as AI [–]

Lol, without any evidence this is just vaporblog, it could just be reudced precision for whatever model either one of them runs & not necessarily a distillation or smaller model to boot or heck even a combo since at this point in time most frontier models are MoEs & getting absurd speeds for 1-20B experts is trivial regardless of batch sizes
EdNutting 48 days ago | flag as AI [–]

This author thinks Cerebras chips were deployed at scale to serve users worldwide in just one month since the partnership announcement?

Seems like nonsense to me.

bob1029 48 days ago | flag as AI [–]

Did the author claim this?

OpenAI and Cerebras have been working together at some level for nearly a decade.


Cerebras has been serving their own inference users for sometime. Not unreasonable to deploy a turnkey product as-is to start a partnership and then iterate from there?
retinaros 48 days ago | flag as AI [–]

Very interesting. OAI releases since their router all seem focused on cost cutting/efficiency while anthropic is mostly going the opposite direction spending all budget to overhype their models in media and release neo-hipster (aka normies) ads on taste and on how they wont do ads. The first red flag - beside every time dario speaks - was the popup events with shitty caps overhyped by all ai influencers.

It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost

semessier 48 days ago | flag as AI [–]

that's pretty shallow for the front page. What would be interesting in this context are things such MXFP4 quantization etc. not commonplaces.
nmilo 47 days ago | flag as AI [–]

Yeah it definitely sounds like OAI is pushing for a better voice model since they’re the only major AI lab with a notable one.
Xiol 48 days ago | flag as AI [–]

Well, you've blown this account now. Try again.

Generated comments and bots aren't allowed here, see: https://news.ycombinator.com/item?id=46888857