How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
If I’m reading this right, this is pretty wild. They turned a Qwen autoregressor into a diffuser by using a bunch of really clever techniques, and they vastly outperform any “native diffuser,” actually being competitive with the base model they were trained from. The obvious upside here is the massive speedup in generation.
And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it “compare” its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).
I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.
I've found the latency and pricing make Mercury 2 extremely compelling for some UX experiments focused around automated note tagging/interlinking. Far more than the Gemini Flash Lite I used before, it made some interactions nearly frictionless, very close to how old school autocomplete/T9/autocorrect works in a manner that users don't even think about the processes behind it.
Sadly, it does not perform at the level of e.g. Haiku 3.5 for tool calling, despite their own benchmarks claiming parity with Haiku 4.5, but it does compete with Flash Lite there too.
Anything with very targeted output, sufficient existing input and that benefits from a seamless feeling lends itself to dLLMs. Could see a place in tab-complete too, though Cursors model seems to be sufficiently low latency already.
The autocomplete comparison is doing a lot of work here. Autocomplete is stateless and local. Diffusion models are still burning cloud compute on every token. Until inference cost drops another order of magnitude, calling it "frictionless like T9" seems more like a UX illusion than a real architectural similarity.
I've been playing with a Swift implementation of a diffusion language model (WeDLM), but performance is not yet acceptable and it still generates roughly from left-to-right like a language model (just within a sliding window rather than strictly token-by-token... but that doesn't matter when the sliding window is only like 16 tokens.)
But doesn't that just push the question back? The diffusion model needs to "know" whether its first pass is good enough to stop — otherwise you're just looping forever. Has anyone actually measured whether these models have that kind of calibrated self-assessment?
The denoising process is already iterative, but semantic self-evaluation -- does this block cohere? -- is a different operation entirely. You'd need some verifier signal embedded in the loop. Whether diffusion models can learn that implicitly during training, rather than requiring an explicit critic, is genuinely unclear to me.
I'm no expert (just a monkey... ;) ), but isn't Diffusion supposed to generate ALL of the output at once? From their diagram, it looks like their I-LDM model seems to use previously generated context to generate the next tokens (or blocks).
Right, so it's block autoregressive with extra steps. Curious what actual throughput looks like vs vanilla AR when you're serving a few hundred concurrent users at 3am and your GPU memory is already sweating.
Block auto regressive generation can give you big speedups.
Consider that outputting two tokens at a time will be a (2-epsilon)x speedup over running one token at a time. As your block size increases, you quickly get to fast enough that it doesn't matter sooooo much whether you're doing blocks or actual all-at-once generation. What matters, then, is there quality trade-off for moving to block-mode output. And here it sounds like they've minimized that trade-off.
Last year, there was a period of a week or two where I would see Gemini responses diffusing in. I don't know if they were experimenting with it, or if it was just an effect. It didn't last long, but it was interesting to see.
I always thought some kind of block-based diffusion architecture would be the future of LLMs, especially some architecture that can dynamically alter its token generation rate as well as "reason and generate at the same time", and have an opportunity to correct tokens that it has just generated. Something like the equivalent of a short term "working memory" for humans. But I have no understanding of the math. Fingers crossed.
And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it “compare” its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).
I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.