Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

jwr · 92 days ago

That is very, very interesting. I've been hoping to have an assistant in the workshop (hands-free!) that I could talk to and have it help me with simple tasks: timers, calculating, digging up notes, etc. — basically, what the phone assistants were supposed to be, but aren't.

"You will have to unlock your iphone first" is kind of a deal-breaker when you are in the middle of mixing polyurethane resin and have gloves and a mask on.

More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers, preventing us from using the technological advances and holding us back years behind the state of the art.

I'll be trying this out on my Macbook, looks very promising!

gtowey · 92 days ago

The computing power we all have in our pockets is staggering. It could be tool that truly makes our lives easier, but instead it's mostly a device that is frustrating to use. Companies have decided to make it simply another conduit for advertising. It's a tool for them to sell us more stuff. Basic usability be damned.

jamilton · 91 days ago

Siri does have a setting that'll activate it if you say "hey siri" while the phone is locked. Obvious privacy and battery usage concerns though, and it's still Siri, so it's a little clunky.

noel60 · 91 days ago

We tried the always-on Siri route in our shop. The battery hit is real, but the bigger annoyance was false triggers — Siri waking up constantly from background noise. Ended up just using a cheap Bluetooth button to activate it manually.

mentalgear · 92 days ago

You might be interested in the open-source https://www.home-assistant.io/voice-pe/ .

simonfeld · 92 days ago

We ran Home Assistant Voice for a bit in our kitchen. The wake word detection was flaky enough that we gave up — maybe 1 in 4 times it just didn't catch it. Could be our environment though. Worth trying before building your own.

huijzer · 92 days ago

> More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers

Yes same with RSS readers being dropped by large companies. Worked too good I guess!

jon78 · 92 days ago

Alexa was supposed to solve this exact problem. 2014. Echo in the workshop, hands-free, always listening. The hardware was fine. The service was the problem — and still is. Local inference removes that dependency entirely.

logicallee · 92 days ago

It might interest people to know you can also easily fine-tune the text portion of this specific model (E2B) to behave however you want! I fine-tuned it to talk like a pirate but you can get it to do anything you have (or can generate) training data for. (This wouldn't make it to the text to speech portion though.) So you can easily train it to act a certain way or give certain types of responses.

Video: https://www.youtube.com/live/WuCxWJhrkIM

Generated writeup: https://taonexus.com/publicfiles/apr2026/pirate-gemma-journa...

3abiton · 89 days ago

We're getting closer now to black mirror level of technology.

dvt · 92 days ago

Solid work and great showcase, I've done a bunch of stuff with Kokoro and the latency is incredible. So crazy how badly Apple dropped the ball... feels like your demo should be a Siri demo (I mean that in the most complimentary way possible).

karimf · 92 days ago

Thank you. This reminds me of a paragraph from the LatentSpace newsletter [0]

> The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple….

https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...

magzter · 92 days ago

This is so cool, I'm always speaking to people about how the advancement in the SOTA hosted AI's is also happening in the local model space, i.e. the SOTA hosted AI models 6-12 months ago are what we're seeing now being able to run locally on average hardware - this is such an amazing way to actually demo it.

est · 92 days ago

I am making something similar. Also been using Kokoro for TTS. Very cool project!

Gemma 4 is kinda too heavyweight even with E2B. I am sticking with qwen 0.8B at the moment.

zerop · 92 days ago

I have been looking forward to build something like this using open models. A voice assisstant I can talk while I am driving, as I do have long commute. I do use chatGPT voice mode and it works great for querying any information or discussions. But I want to do tasks like browsing web, act like a social media manager for my business etc.

myultidevhq · 91 days ago

This is really impressive for running locally on an M3 Pro. The latency looks surprisingly good for real-time audio and video input.

Curious about one thing though, how does it handle switching between languages? I work with both Greek and English daily and local models usually struggle with that.

Great work, bookmarking this.

karimf · 91 days ago

During my limited testing, it works better than I expected at handling multiple languages in a single session. Perhaps I just had a low expectation since I've mostly worked with English-only STT models.

crsAbtEvrthng · 91 days ago

If I run this without internet connection it says "loading..." at the bottom of the localhost site and won't work.

If I run this with internet connected it works flawlessly. Even if I disconnect my internet afterwards it still goes on working fine.

Why there has to be an internet connection established at the time I open the localhost site when all of this should be working purely on device?

Despite of this, I am really impressed that this actually works so fast with video input on my M4 Pro 48 GB.

karimf · 91 days ago

Huh that's weird. I just tried it and it works on my machine. Could you perhaps create a GitHub issue and share the reproduction steps and any relevant logs?

matula · 91 days ago

the index.html is loading remote js files: https://github.com/fikrikarim/parlor/blob/main/src/index.htm...

I saved them locally and changed the reference, and it worked perfectly.

noodlebreak · 91 days ago

I have to try it out on my idle laptops. I've been meaning to run some models on them for low cost tasks that need AI - like sorting and filtering photos from 100s of thousands that I have amassed over the years. And applying general size reduction compression to the filtered ones.

Btw if anyone has already created such a pipeline/workflow using such models, please lmk!

spwa4 · 91 days ago

I've been trying to do this, but I can't get voice recognition to work fast enough (meaning live) with Gemma E2B, on either an M1 max (64GB), a 5060 Ti (16Gb) or a SnapDragon 8 Gen2.

Any pointers?

karimf · 91 days ago

What's your average response time with M1 max and what's the target?

jff36 · 91 days ago

Voice recognition latency is rarely the bottleneck. If you're running Gemma E2B with a streaming VAD like silero, the first token should appear under 300ms on an M1 max. Are you batching audio chunks or processing on every frame?

rubicon33 · 91 days ago

Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?

I’ve been looking for a good video summarizing / understanding model!

karimf · 91 days ago

Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response.

In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]

[0] https://huggingface.co/blog/gemma4#video-understanding

divan · 92 days ago

Can someone quickly vibe code MacOS native app for that so it doesn't require running terminal commands and searching for that browser tab? (: (also for iOS, pls)

duartefdias · 92 days ago

Would you pay 2$ for that MacOS native desktop app?

inzlab · 91 days ago

Real time ai sounds like the future

an0n-elem · 91 days ago

Cool work buddy:)

jareklupinski · 91 days ago

just make it say "Uh...", "umm...", or "hmmm..." once or twice halfway between processing and finish :D

k-almuraee · 92 days ago

Amazing, love your work ,

noel686 · 91 days ago

Cool until the model decides to OOM at 2am and your audio pipeline hangs silently. No watchdog, no restart policy, no alerting. Good luck with that in prod.