Introspective Diffusion Language Models

thepasch · 84 days ago

If I’m reading this right, this is pretty wild. They turned a Qwen autoregressor into a diffuser by using a bunch of really clever techniques, and they vastly outperform any “native diffuser,” actually being competitive with the base model they were trained from. The obvious upside here is the massive speedup in generation.

And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it “compare” its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).

I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.

xiphias2 · 83 days ago

How does this compare to DFlash?

https://z-lab.ai/projects/dflash/

And DDTree?

https://liranringel.github.io/ddtree/

andsoitis · 84 days ago

Is anyone here experimenting seriously with Diffusion for text generation? I’d love to learn about your experiences!

Topfi · 84 days ago

I've found the latency and pricing make Mercury 2 extremely compelling for some UX experiments focused around automated note tagging/interlinking. Far more than the Gemini Flash Lite I used before, it made some interactions nearly frictionless, very close to how old school autocomplete/T9/autocorrect works in a manner that users don't even think about the processes behind it.

Sadly, it does not perform at the level of e.g. Haiku 3.5 for tool calling, despite their own benchmarks claiming parity with Haiku 4.5, but it does compete with Flash Lite there too.

Anything with very targeted output, sufficient existing input and that benefits from a seamless feeling lends itself to dLLMs. Could see a place in tab-complete too, though Cursors model seems to be sufficiently low latency already.

grantger · 84 days ago

The autocomplete comparison is doing a lot of work here. Autocomplete is stateless and local. Diffusion models are still burning cloud compute on every token. Until inference cost drops another order of magnitude, calling it "frictionless like T9" seems more like a UX illusion than a real architectural similarity.

LoganDark · 84 days ago

I've been playing with a Swift implementation of a diffusion language model (WeDLM), but performance is not yet acceptable and it still generates roughly from left-to-right like a language model (just within a sliding window rather than strictly token-by-token... but that doesn't matter when the sliding window is only like 16 tokens.)

simianwords · 84 days ago

Can diffusion models have reasoning steps where they generate a block, introspect and then generate another until the output is satisfactory?

moeadham · 84 days ago

Well, you can take the output of a first pass and pass it back through the model like AR “reasoning” models do at inference time.

krh77 · 84 days ago

But doesn't that just push the question back? The diffusion model needs to "know" whether its first pass is good enough to stop — otherwise you're just looping forever. Has anyone actually measured whether these models have that kind of calibrated self-assessment?

hsj47 · 84 days ago

The denoising process is already iterative, but semantic self-evaluation -- does this block cohere? -- is a different operation entirely. You'd need some verifier signal embedded in the loop. Whether diffusion models can learn that implicitly during training, rather than requiring an explicit critic, is genuinely unclear to me.

mlmonkey · 83 days ago

I'm no expert (just a monkey... ;) ), but isn't Diffusion supposed to generate ALL of the output at once? From their diagram, it looks like their I-LDM model seems to use previously generated context to generate the next tokens (or blocks).

fhess · 83 days ago

Right, so it's block autoregressive with extra steps. Curious what actual throughput looks like vs vanilla AR when you're serving a few hundred concurrent users at 3am and your GPU memory is already sweating.

sdenton4 · 83 days ago

Block auto regressive generation can give you big speedups.

Consider that outputting two tokens at a time will be a (2-epsilon)x speedup over running one token at a time. As your block size increases, you quickly get to fast enough that it doesn't matter sooooo much whether you're doing blocks or actual all-at-once generation. What matters, then, is there quality trade-off for moving to block-mode output. And here it sounds like they've minimized that trade-off.

ilaksh · 83 days ago

Does this mean I should switch to sglang? How hard is it to add the capability for these type of models to vLLM? Or does it already handle them?

shepardrtc · 82 days ago

Last year, there was a period of a week or two where I would see Gemini responses diffusing in. I don't know if they were experimenting with it, or if it was just an effect. It didn't last long, but it was interesting to see.

ramon156 · 84 days ago

> 2025-04-12: Initial code release with training and inference support.

> 2025-04-12: Released I-DLM-8B, I-DLM-32B, and I-DLM-8B-LoRA on HuggingFace.

Is this old already? Not saying that's a bad thing, since it seems very sophisticated. Just curious if there's an update

oersted · 84 days ago

It's clearly a typo on the year, April 12 was two days ago, a quick check in HuggingFace shows that they were uploaded 5 days ago.

scotty79 · 84 days ago

So can you just use this and have a faster Qwen32b?

https://huggingface.co/yifanyu/I-DLM-32B/tree/main

2001zhaozhao · 83 days ago

I always thought some kind of block-based diffusion architecture would be the future of LLMs, especially some architecture that can dynamically alter its token generation rate as well as "reason and generate at the same time", and have an opportunity to correct tokens that it has just generated. Something like the equivalent of a short term "working memory" for humans. But I have no understanding of the math. Fingers crossed.

keyle · 83 days ago

This looks great. Can we use it yet?

Openpic · 83 days ago

3倍向上したとこのとですが、ボトルネックはMemory BandwidthからComputeに移行したの？それともMemory Bandwidthが支配的ですか？

salviati · 83 days ago

This translates to

> I understand it improved by 3x, but has the bottleneck shifted from Memory Bandwidth to Compute? Or is Memory Bandwidth still dominant?

But why did you post your comment in Japanese? We have so many good options for automated translation nowadays!

でも、なぜ日本語でコメントを投稿したんですか？最近は自動翻訳の良い選択肢がたくさんあるのに！

fumblebee · 83 days ago

I'm not in on the joke, can someone ELI5