How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
Suggestion for the maintainers: the comparison table currently lists some pretty old models, Qwen 2.5 14B and Mixtral 8x7B and Llama 3.3 70B.
A lot of people are reporting incredible results with the Qwen 3.5 MoE models on Apple hardware right now (streaming experts - see https://simonwillison.net/2026/Mar/24/streaming-experts/) - it would be great to get some of those models into that table.
Thanks for sharing this! If you'd be interested in running the benchmark yourself with Hypura I'd happily merge into our stats. Otherwise will add to my todo list :)
Simon, A little offtopic but it seems that your website isn't working.
> An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details. You can do this from the Heroku CLI with the command
(I checked your website because I wanted to see if you had written something about trivy/litellm as well, I highly recommend checking out what has happened within litellm space if possible as I would love to read your thoughts on it)
Have a nice day simon!
Edit: now the website works but I am not sure what had gone wrong previously, (an issue from heroku maybe?) as its working now
Edit-2: after the website working, I am able to see that you have already made a post about it.
For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.
the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.
for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.
still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.
Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.
> for a 1T model youd need to stream something like 2TB of weights per forward pass
Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.
The sparse read pattern sounds clean on paper. Wait until your expert routing starts hitting the same few experts under load and you're back to sequential hotspots on one nvme while the others idle.
ran this on an M1 Pro last week and the access pattern is definitely more random than sequential in practice. queue depth stays low since there's no aggressive prefetch happening. curious what the 4K random numbers actually look like at QD1 on Apple silicon.
I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)
The MoE point matters here ie sparse activation means you're not reading all 2TB per forward pass, but the access pattern flips from sequential to random which is exactly the worst case for NVMe. Been thinking about this a lot for agent inference workloads where you want consistent latency more than peak throughput.
Still have 4 brand new ones in my storage unit. Just in case these moments.
Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.
It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.
Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).
> Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.
macOS doesn't have an "OOM killer" in that sense. (It has an out of swap space killer but it's pretty weak.)
So what will happen is, either your memory wiring will fail, or else it will get really slow and panic.
This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.
I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.
Right, read workloads are basically free from a wear standpoint. The concern is more about latency — NVMe reads add up when you're loading weights layer by layer, and that can noticeably slow generation versus keeping everything hot in RAM.
> but in a 'smart' way so you don't overload the NVMe unnecessarily
"overloading NVMe"? What is that about? First time I've heard anything about it.
> because putting a ton of stress on your NVMe during generation
Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.
Even if there was a ton of writing, I'm not sure where NVMe even comes in the picture, write durability is about the flash cells on SSDs, nothing to do with the interface, someone correct me if I'm wrong.
Same pattern as the old SGI NUMA systems circa 1995 — keep compute local, stream data in from slower tiers. Nothing new, but nice to see it done cleanly on consumer hardware.
There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.
Ollama has very substandard support for mmap at present, which hurts inference with larger models. There are some recent pull requests in flight that should help address this to at least some extent https://github.com/ollama/ollama/pull/14525https://github.com/ollama/ollama/pull/14134https://github.com/ollama/ollama/pull/14864 but progress seems to be stalling. Their support for recent Qwen models seems to also have some bespoke incompatibilities with llama.cpp, which doesn't help matters; it's difficult to test the same model with both.
You do not provide any comparison to llama.cpp with mmap.
You do not explain how any kind of predictor can work for MoE experts.
You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).
OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know
you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the
4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per
token.
What makes this approach faster is that the model's access pattern is completely deterministic during
inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal.
The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."
For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,
then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
expert 7. The neuron cache here is basically a domain-specific replacement policy.
Has anyone benchmarked madvise MADV_SEQUENTIAL against explicit async prefetch for this workload? My intuition is the granularity difference matters — madvise still operates on pages, not weight tensors, and the readahead heuristics might fight you on irregular access patterns.
The framing assumes NVMe offloading is a capability gap to close, but it's really a workaround for running models too large for your hardware. Buying more RAM costs less than the engineering overhead here.
This is interesting work, thank you for sharing. What hardware would you buy today for experimenting? Seems like the new gen of macbook pros are pretty powerful?
Yes definitely. I use a M1 Max with 32gb of RAM daily and it's about on par from a performance standpoint with the new base M5 Pro 24gb. You can check the benchmarks in the repo if you're interested in seeing specific performance metrics, but investing in Apple hardware with as much memory as possible will generally get you furthest in this game.
This doesn't surprise me all that much, mmap support gets little attention in general and interacts poorly with GPU-side inference. (And that's with it being default, you don't even really need to specify it as a CLI option.) OP has raised a discussion with the llama.cpp folks https://github.com/ggml-org/llama.cpp/discussions/20852 but little interest so far
A lot of people are reporting incredible results with the Qwen 3.5 MoE models on Apple hardware right now (streaming experts - see https://simonwillison.net/2026/Mar/24/streaming-experts/) - it would be great to get some of those models into that table.
Maybe the 1T parameter Kimi K2.5 too if you can get that to work, see https://twitter.com/seikixtc/status/2036246162936910322 and https://twitter.com/danpacary/status/2036480556045836603