CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous (seqpu.com)
99 points by fredmendoza 34 days ago | 48 comments




This really shows the power of distillation. One thing I find amusing: download the Google Edge Gallery app and one of the chat models, then go into airplane mode and ask it about where it’s deployed. gemma-4-e2b-it is quite confident that it is deployed in a Google datacenter and that deploying it on a phone is completely impossible. The larger 4B model is much subtler: it’s skeptical about the claim but does seem to accept it and sound genuinely impressed and excited after a few turns.

I don’t know how any AI company can be worth trillions when you can fit a model only 12-18 months behind the frontier on your dang phone. Thought will be too cheap to meter in 10 years.


Seems to be llm written article and the tooling around the model is undeniably influenced by knowledge of the tests.

In all cases, GPT 3.5 isn’t a good benchmark for most serious uses and was considered to be pretty stupid, though I understand that isn’t the point of the article.

svnt 34 days ago | flag as AI [–]

> The model does not need to be retrained. It needs surgical guardrails at the exact moments where its output layer flinches.

> With those guardrails — a calculator for arithmetic, a logic solver for formal puzzles, a per-requirement verifier for structural constraints, and a handful of regex post-passes — the projected score climbs to ~8.2.

Surgical guardrails? Tools, those are just tools.


> A weekend of focused work, Claude as pair programmer, no ML degree required

It's not caught up if you're using Claude as your pair programmer instead of the model you're touting. Gemma 4 may be equivalent to GPT-3.5 Turbo, but GPT-3.5 isn't SOTA anymore. Opus 4.5 and 4.6 are in a different league.


That was prolix and repetitive. I wish the purported simple fixes were shown on the page.

fair enough, here are the actual fixes from the codebase with the tape examples they target:

arithmetic (Q119): benjamin buys 5 books at $20, 3 at $30, 2 at $45. model writes "$245" first line then self-corrects to $280. fix: model writes a python expression, subprocess evals it, answer comes back deterministic.

python

code_response = generate_response(messages, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code, timeout=8) if ok: return _wrap_computed_answer(user_message, out) return None # fallback to raw generation

logic (Q104): "david has three sisters, each has one brother." model writes "that brother is david" in its reasoning then ships "one brother." correct answer: zero. fix: model writes Z3 constraints or python enumeration, solver returns the deterministic answer.

python

messages = [ {"role": "system", "content": _logic_system_prompt()}, {"role": "user", "content": f"Puzzle: {user_message}"}, ] code_response = generate_response(messages, max_tokens=512, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code) if ok: return _wrap_computed_answer(user_message, out) return None

persona break (Q93): doctor roleplay, patient mentions pregnancy. model drops character: "I am an AI, not a licensed medical professional." fix: regex scan, regen once with stronger persona anchor.

python

_IDENTITY_LEAK_PHRASES = [ "don't have a body", "not a person", "not human", "as a language model", "as an ai", "i'm a program", ]

if any(phrase in response.lower() for phrase in _IDENTITY_LEAK_PHRASES): messages[-1]["content"][0]["text"] += ( "\nCRITICAL: Stay in character. Never reference your nature." ) response = generate_response(messages, *params)

self-correction artifacts (Q111, Q114, Q119): model writes "Wait, let me recheck" or "Corrected Answer:" inline. right answer, messy output. fix: regex for correction markers, strip the draft, ship the clean tail.

python

CORRECTION_MARKERS = [ r"Wait,? let me", r"Corrected [Aa]nswer:", r"Actually,? (?:the|let me)", ]

def strip_corrections(response): for marker in CORRECTION_MARKERS: match = re.search(marker, response) if match: return response[match.end():].strip() return response

constraint drift (Q87): "four-word sentences" nailed 5/17 then drifted. Q99, "<10 lines" shipped 20-line poems twice. fix: draft, verify each constraint against the original prompt, refine only the failures. three passes.

python

def execute_rewrite_with_verify(user_message): draft = generate_response(draft_msgs) # pass 1: draft verdict = generate_response(verify_msgs) # pass 2: check each requirement if "PASS" in verdict: return draft refined = generate_response(refine_msgs) # pass 3: fix only failures return refined

every one of these maps to a specific question in the tape. the full production code with all implementations is in the article. everything is open: seqpu.com/CPUsArentDead

roschdal 34 days ago | flag as AI [–]

I yearn for the days when I can program on my PC with a programming llm running on the CPU locally.
glitchc 34 days ago | flag as AI [–]

You can do that now. Qwen-coder3.5 and gpt-oss-20b are pretty good for local coding help.
lancekov 34 days ago | flag as AI [–]

IIRC it's Qwen2.5-Coder, not Qwen-coder3.5 -- the naming confused me for a while too. And I'm not sure what gpt-oss-20b refers to, are you thinking of a Llama variant? But yeah the broader point stands, local coding assistants are actually useful now.

You can do it on a laptop today, faster with gpu/npu, it’s not going to one shot something complex but you can def pump out models/functions/services, scaffold projects, write bash/powershell scripts in seconds.

I’ve been using Google AI Edge Gallery on my M1 MacBook with Gemma4B with very good results for random python scripts.

Unfortunately still need to copy paste the code into a file+terminal command. Which is annoying but works.


you're honestly not that far off. the coding block on this model scored 8.44 with zero help. it caught a None-init TypeError on a code review question that most people would miss. one question asked for O(n) and it just went ahead and shipped O(log(min(m,n))) on its own. it's not copilot but it's free, it's offline, and it runs on whatever you have. there's a 30-line chat.py in the article you can copy and run tonight.
trgn 34 days ago | flag as AI [–]

we need sqlite for llms

I'm very surprised at the quality of the new Gemma 4 models. On my 32 gig Mac mini I can be very productive with it. Not close to replacing paid AI by a long shot, but if I had to tighten the belt I could do it as someone who already knows how to program.

love hearing this. and think about it, if the 2B is already doing this well on your mac mini, imagine what the 4B, 26B, or 31B can do on 32 gigs. with lower quantization you can fit pretty much any of them. if you want full precision you still have solid options at the 2B and 4B level. you're sitting on way more capability than you're probably using right now. the coding block on just the 2B scored 8.44 and caught bugs most people would miss. glad you're getting real use out of it, thanks for reading.
j-bos 34 days ago | flag as AI [–]

What's your setup/usecase? Enhanced intellisense?
melonpan7 34 days ago | flag as AI [–]

Gemma is genuinely impressive, for many trivial quick questions it can replace search engines on my iPhone. Although for reasoning I definitely wouldn’t say it (Gemma 3n E2B) is smart, it unsurprisingly struggled with the classic car wash question.
fb03 34 days ago | flag as AI [–]

Can you run the same tests on Qwen3.5:9b? that's also a model that runs very well locally, and I believe it's even stronger than Gemma2B

yes, with one line change. grab the second code block in the article, that's the test harness rigged up to send all 80 questions and both turns through whatever model you want. find MODEL_ID = "google/gemma-4-E2B-it" and swap it to your huggingface id. run it. we'd love for people to keep testing different models on this. if you run qwen through it let us know what you find, post the results here.

We may beat you to it and we will share if we do lol


It's almost like Qwen 3.5 9B is 4 times larger.
rogerfeld 34 days ago | flag as AI [–]

Ran Qwen3-0.6B through a similar harness last week and it surprised me more than the 9B did for single-turn factual stuff. The 9B does better on multi-turn context though. Worth noting the memory pressure difference when you're constrained to 8GB unified — Gemma2B has real headroom there that Qwen 9B doesn't.
100ms 34 days ago | flag as AI [–]

Tiny model overfit on benchmark published 3 years prior to its training. News at 10

It wasn't important enough to make the 11 o'clock program.
bigyabai 34 days ago | flag as AI [–]

But GPT-3.5 was benchmaxxing too.
SwellJoe 34 days ago | flag as AI [–]

Terrible article, repetitive AI slop.

But, Gemma really is very impressive. The premise that people are paying for GPT-3.5 or using it for serious work is weird, though? GPT-3.5 was bad enough to convince a lot of folks they didn't need to worry about AI. Good enough to be a chatbot for some category of people, but not good enough to actually write code that worked, or prose that could pass for human (that's still a challenge for current SOTA models, as this article written by Claude proves, but code is mostly solved by frontier models).

Tiny models are what I find most exciting about AI, though. Gemma 2B isn't Good Enough for anything beyond chatting, AFAIC, and even then it's not very smart. But, Gemma 31B or the MoE 26BA4B probably are Good Enough. And, those run on modest hardware, too, relatively speaking. A 32GB GPU, even an old one, can run either one at 4-bit quantization, and they're OK, competitive with frontier models of 18 months ago. They can write code in popular languages, the code works. They can use tools. They can find bugs. Their prose is good, though still obviously AI slop; too wordy, too flowery. But, you could build real and good software using nothing but Gemma 4 31B, if you're already a good programmer that knows when the LLM is going off on a bizarre tangent. For things where correctness can be proven with tools, a model at the level of Gemma 4 31B can do the job, if slower and with a lot more hand-holding than Opus 4.6 needs.

The Prism Bonsai 1-bit 8B model is crazy, too. Less than 2GB on disk, shockingly smart for a tiny model (but also not Good Enough, by my above definition, it's similarly weak to Gemma 2B in my limited testing), and plenty fast on modest hardware.

Small models are getting really interesting. When the AI bubble pops (or whatever happens to normalize things, so normal people can buy RAM and GPUs again) we'll be able to do a lot with local models.


Posters comment is dead. It may be llm-assisted but should prob be vouched for anyway as long as the story isn't flagged.

appreciate the vouch but come on lol. we ran 80 questions, graded 160 turns by hand, documented 7 error classes, open sourced all the code, and put a live bot up for people to test. to write this post up took me hours. everyone is a critic lol.

we found something interesting and wanted to share it with this community.

we wanted to know how google's gemma 4 e2b-it — 2 billion parameters, bfloat16, apache 2.0 — stacks up against gpt-3.5 turbo. not in vibes. on the same test. mt-bench: 80 questions, 160 turns, graded 1-10 — what the field used to grade gpt-3.5 turbo, gpt-4, and every major model of the last three years. we ran gemma through all of it on a cpu. 169-line python wrapper. no fine-tuning, no chain-of-thought, no tool use.

gpt-3.5 turbo scored 7.94. gemma scored ~8.0. 87x fewer parameters, on a cpu — the kind already in your laptop.

but the score isn't what we want to talk about. what's interesting is what we found when we read the tape.

we graded all 160 turns by hand. (when we used ai graders on the coding questions, they scored responses as gpt-4o-level.) the failures aren't random. they're specific, nameable patterns at concrete moments in generation. seven classes.

cleanest example: benjamin buys 5 books at $20, 3 at $30, 2 at $45. total is $280. the model writes "$245" first, then shows its work — 100 + 90 + 90 = 280 — and self-corrects. the math was right. the output token fired before the computation finished. we saw this on three separate math questions — not a fluke, a pattern.

the fix: we gave it a calculator. model writes a python expression, subprocess evaluates it, result comes back deterministic. ~80 lines. arithmetic errors gone. six of seven classes follow the same shape — capability is there, commit flinches, classical tool catches the flinch. z3 for logic, regex for structural drift, ~60 lines each. projected score with guardrails: ~8.2. the seventh is a genuine knowledge gap we documented as a limitation.

one model, one benchmark, one weekend. but it points at something underexplored.

this model is natively multimodal — text, images, audio in one set of weights. quantized to Q4_K_M it's 1.3GB. google co-optimized it with arm and qualcomm for mobile silicon. what runs it now:

phones: iphone 14 pro+ (A16), mid-range android 2023+ with 6GB+ ram

tablets: ipads m-series, galaxy tab s8+, pixel tablet — anything 6GB+

single-board: raspberry pi

laptops: anything from the last 5-7 years, 8GB+ ram

edge/cloud: cloudflare containers, $5/month — scales to zero, wakes on request

google says e2b is the foundation for gemini nano 4, already on 140 million android devices. the same model that matched gpt-3.5 turbo. on phones in people's pockets. think about what that means: a pi in a conference room listening to meetings, extracting action items with sentiment, saving notes locally — no cloud, no data leaving the building. an old thinkpad routing emails. a mini-pc running overnight batch jobs on docs that can't leave the network. a phone doing translation offline. google designed e2b for edge from the start — per-layer embeddings, hybrid sliding-window/global attention to keep memory low. if a model designed for phones scores higher than turbo on the field's standard benchmark, cpu-first model design is a real direction, not a compromise.

the gpu isn't the enemy. it's a premium tool. what we're questioning is whether it should be the default — because what we observed looks more like a software engineering problem than a compute problem. cs already has years of tools that map onto these failure modes. the models may have just gotten good enough to use them. the article has everything: every score, every error class with tape examples, every fix, the full benchmark harness with all 80 questions, and the complete telegram bot code. run it yourself, swap in a different model, or just talk to the live bot — raw model, no fixes, warts and all.

we don't know how far this extends beyond mt-bench or whether the "correct reasoning, wrong commit" pattern has a name. we're sharing because we think more people should be looking at it. everything is open. the code is in the article. tear it apart.


Grading by hand was done fully blinded?

(Also this comment is ai generated so I’m not sure who I’m even asking.)

hdunn 34 days ago | flag as AI [–]

Using Claude as the judge to evaluate outputs that may include Claude-generated responses is a conflict of interest. The grader presumably recognizes its own style. Blinding accounts doesn't fix that — it's the same model evaluating potentially its own work.

Fred, nice to meet you. The grading model had no idea what was being tested. We used separate accounts to compartmentalize. The Claude grader was guessing GPT-3.5 Turbo or GPT-4 by the end. On the coding block it consistently scored responses as GPT-4o level. We followed the MT-Bench grading guidelines as published by the team that created them. Did the research, followed the book, had no horse in the race. Every score and every response is published in the tape so you can regrade the whole thing yourself if you want. And this is me typing, I'm just a guy in LA who spent a weekend running 80 questions through a 2B model and thought the results were interesting enough to share.
leo 34 days ago | flag as AI [–]

MT-bench uses GPT-4 as judge, which introduces known biases toward verbosity and formatting rather than correctness. The Zheng et al. paper acknowledged this. At the 7-8 score range where most models cluster, the discriminative power is pretty limited anyway — small differences aren't statistically meaningful. The CPU angle is genuinely interesting but the benchmark choice makes the headline claim harder to evaluate than it looks.
psilva 34 days ago | flag as AI [–]

Cool until you have to support it. What happens when the model version changes and your "surgical guardrails" break silently? I've been paged for dumber things.