Show HN: Agent-desktop – Native desktop automation CLI for AI agents

jstanley · 18 days ago

lahfir, I vouched your (currently still dead) comment because it was interesting to me.

I expect the reason it is dead is that it seems LLM-generated (you "quietly" launched it on github? Who says that?).

Also, your comment claims that the tool is cross-platform and implies that it works on Mac, Windows, and Linux, but the graphic on the github README says it only works on Mac.

nerdsniper · 18 days ago

It looks hybrid human/LLM at best, but definitely possible that it's mostly human, from someone who is earnestly learning how to use "pitch" language. I got the feeling that some parts, like the bullet points, maybe originated from AI-generated documentation/readme's.

My intuition tells me that it could have been AI-generated, but if that's the case then it was heavily edited by a human. I think anyone who went through it for that would have changed other things as well. That's why I suspect it's pseudo-artificial pitch "coded" human writing with some (mostly, lightly edited) copy/paste of AI bullet points.

Then again, I can't find snippets of this language in the repo, so maybe I'm losing my discernment as LLMs advance (as well as the humans who are learning how to use them).

lahfir · 15 days ago

Hey, thanks for the comment. Yes it's hybrid. AI wrote what I gave as an input. If it's better in articulating a message much better than me, why not use it, right?

jbreckmckye · 18 days ago

I think this guy is using AI for pretty much everything - he says as much in his GH profile. In fact his photo bears a Gemini watermark, meaning that is AI too.

eli · 18 days ago

Your instinct about heavy editing is probably right - that tends to be the hardest case for detection, since edits often target meaning but leave the syntactic fingerprint.

preommr · 18 days ago

Wouldn't the opposite be true? That an llm would use well-known terms for general purpose writing. I think it's much more likely that a human would remember 'silent' launch, or 'stealth' launch, and use silent as a substitute.

I feel very strongly that comment wasn't AI generated.

Also, there's a bunch of normal comments that seem to be wrongfully flagged.

jstanley · 17 days ago

> Wouldn't the opposite be true? That an llm would use well-known terms for general purpose writing.

You'd think, and yet LLMs do in fact have a particular style, and lots of it is common across all LLMs.

vasco · 18 days ago

3 fake comments in the thread also

handfuloflight · 18 days ago

Why is Claude always pointing out or assuming what is done quietly?

slow_wire · 18 days ago

So Mac is GA but the README graphic showing only Mac was misleading?

lahfir · 15 days ago

Hi. Mac is GA. Windows and Linux are in the roadmap.

esperent · 18 days ago

Looks interesting but like every single one of these computer use apps I've seen, it's macOS only.

Does anyone know of a linux one?

Zetaphor · 18 days ago

I don't think the accessibility story on Linux is comprehensive enough to make this possible unfortunately. Especially with Wayland. One advantage Mac apps have is they're all targeting the same underlying OS primitives, which is the layer their accessibility platform lives at.

tuukkah · 18 days ago

Quote from a sibling comment:

  - macOS: Accessibility API
  - Windows: UI Automation
  - Linux: AT-SPI

9879875665876 · 18 days ago

There is AT-SPI2:

https://invent.kde.org/sdk/selenium-webdriver-at-spi

gvkhna · 17 days ago

Built this but not open source because it’s more designed for the bot evasion/stealth market. But it’s designed to let ai control the real browser without cdp. So no chromium forks/nothing to “detect.”

lahfir · 15 days ago

Mac is Generally Available. Windows and Linux are on the roadmap!

TheFragenTaken · 18 days ago

I've long thought about why the tools we have operate on screenshots, and not the accessibility tree. To me the latter would have seemed like the obvious choice from the beginning (structured data), but yet, here we are with pixels. Happy to see progress being made here.

MattRogish · 17 days ago

The major limitation is that macOS apps do not have to use the API and so there will always need to be a fallback to something like screen scraping for controls that don’t use it.

Zoom Desktop app is a prime example of this. Many of the windows (join a meeting, settings etc) are normal macOS ones, and those use AX buttons, but many are poorly / weirdly labeled (if at all).

But once the Zoom meeting appears, that’s all (?) custom, and so the best you can do is whatever Zoom decided to offer. The dreaded “this meeting is being recorded” pop up is a custom control and so doesn’t have AX at all; I have automation that basically looks for an appearing window and if it has “OK” just blindly click it and hope for the best.

tidbeck · 18 days ago

While the accessibility tree is great in many aspects it has its own limitations for example when it comes to stacked views or lazy loading outside the viewport.

nlitened · 18 days ago

I think screenshots also don't help with stacked views and lazy loading outside the viewport

hithere12 · 13 days ago

Really interesting idea! The demo made me realize I just assumed screenshot-based control was the only option.

_crowecawcaw · 17 days ago

I actually built nearly the same tool under the same name: https://agent-desktop.dev And I've seen a couple other similar projects since then too! Seems like a lot of us are thinking in the same direction.

One wrinkle I found is that there wasn't a cross-platform library for accessibility APIs, and each platform is a bit different. I made an a11y library that supports Mac, Windows, and X11 and Wayland on Linux with consistent interface: https://xa11y.dev

someone654 · 18 days ago

Looks very interesting. Especially like that language environment is abstracted away, through cli, such that one are not stuck with for example python to write your UI logic (or create your own cli wrapper around PyAutoGUI).

How can one help with implementing Linux and Windows support?

lahfir · 15 days ago

Hi, Mac is generally available. Windows and Linux are in the roadmap!

DeathArrow · 17 days ago

I presume this only works if you use native OS interfaces like MFC in Windows, Cocoa in macOS or GTK in Linux.

It would be nice if it could work if you use GUI libraries that talk directly to hardware like Capy for Zig, egui for Rust or Dear ImGui for C++.

xnx · 18 days ago

The best desktop automation system would take HDMI input and output USB keystrokes and mouse movements so that it can be plugged into any computer transparently, including work computers.

ActorNightly · 18 days ago

You don't need hdmi out, just ability to do screenshots, which easy to script.

Arguably though, browser automation gets you 95% of the way there for most things.

xnx · 18 days ago

Many systems won't allow the end user to install any software (e.g. work issued laptops), but you can plug in HDMI and USB.

lukewarm707 · 17 days ago

if you can attach a local llm...hdmi is airgapped (sort of)...

the operating computer requires no processing power or install....

it plugs into any interface............

i plug it into a scada...............

$$$$$$

dmd · 17 days ago

What’s the purpose of all the dots?

zuzululu · 18 days ago

This is neat! Tried the finder example and was impressed how quick it was.

I would love it if it can support ios simulator, iphone? I am using Maestro but it is so damn slow and seems to be token hungry.

handfuloflight · 18 days ago

https://github.com/callstackincubator/agent-device

has10 · 18 days ago

Is the tree access actually a hard requirement, or would screenshot-based coordinate clicking work as a fallback for Flutter? Genuine question — not clear if agent-device even supports that mode.

zuzululu · 17 days ago

seems like this is for React Native, flutter won't expose the trees, bummer.

Guess I'm stuck with Maestro

dorianzheng · 13 days ago

is it possible to run it inside local micro-VM, such as boxlite?

rado · 18 days ago

Interesting, would be nice to see a demo video apart from that unclear GIF

lahfir · 15 days ago

Hi, appreciate the comment. Here's the demo: https://x.com/mdlahfir/status/2050402398498414908?s=20

z3ratul163071 · 18 days ago

i knew it... macos

dotancohen · 18 days ago

OP claims cross platform.

  > It's a cross-platform CLI for structured desktop automation through the accessibility tree.

lahfir · 15 days ago

Mac for now. Windows and Linux in the roadmap

DeathArrow · 18 days ago

This is big if it works. Nice job!

eweber · 17 days ago

What happens when the screen resolution changes mid-run, or the app redraws during a click? Screenshot-based control has a dozen race conditions nobody talks about until 3am.