Can LLMs model real-world systems in TLA+? (sigops.org)
122 points by mad 11 days ago | 32 comments




I feel LLMs are indeed getting better at writing models. But, in my experience, they struggle to come up with correct safety and liveness properties unless you closely work with them. And of these two, they struggle the most with correct liveness properties.

Also for some problems I observe that models produced by LLMs often cause state space explosion. For simpler models they can fix this when you guide them though.

I’m sure LLMs will get even better.

That said, I take slightly different approach. Lamport said “If you're thinking without writing, you only think you're thinking.” So taking that advice I always try to write the first draft with hand and once I have the final shape in place I then turn to an LLM for further exploration and experimentation if I have to.

tmaly 11 days ago | flag as AI [–]

I remember NVIDA sponsored a TLA+ challenge last year https://foundation.tlapl.us/challenge/index.html
iFire 11 days ago | flag as AI [–]

I don't use tla+ to model real-world systems anymore, Claude is able to model systems in Lean 4 and the binary executable can handle real input or I can directly generate c / rust on proofs with numeric types that have ring structure (integers, rationals, bits).

https://github.com/lambdaclass/truth_research_zk

dgacmu 11 days ago | flag as AI [–]

This post reads like an accidental advertisement for approaches like Verus [1], which couple the implementation and verification so you can't end up with a model that diverges from the actual implementation. I'm personally much more optimistic about the verus approach, but I freely admit that's my builder bias speaking.

[1] https://github.com/verus-lang/verus

tombert 11 days ago | flag as AI [–]

Claude has certainly been getting better with TLA+. It's not perfect yet but for laughs I got it to model the rules of Monopoly last night [1]. I haven't done any exhaustive checking on it yet, but it certainly looks passable.

It is pretty impressive at how good it's gotten at this, in a relatively short amount of time no less. I still usually write my specs by hand, but who knows how much longer I'll be doing that.

[1] https://pdfhost.io/v/KU2j37YKrP_Monopoly

ofrzeta 11 days ago | flag as AI [–]

It looks quite complicated and I have no idea what it is doing. Obviously, since I don't know about TLA+. But what about someone who knows TLA+? It still seems hard to make sure it is valid. And it's just for a relatively simple game.

What's the advantage of provable correctness if it's apparently not easy to prove even for people who understand TLA+? I'm not trying to be a party pooper, just curious.

Isn't logical incorrectness less of a problem in software than failures of imagination or conscientiousness in modeling the domain?

hugo84 10 days ago | flag as AI [–]

Logical incorrectness absolutely is the problem. I've been paged at 3am because two engineers had different mental models of what "consistent" meant. TLA+ forces you to write it down precisely. Whether anyone can actually read the output is a separate problem.

> I haven't done any exhaustive checking on it yet, but it certainly looks passable.

isn't that exactly the kind of fails LLMs do the most? first-glance-passable nonsense?

kate 11 days ago | flag as AI [–]

That's a fair criticism, and it applies here. The harder question is whether "passable" TLA+ is worse than no TLA+ -- there's some evidence that underspecified or incorrect formal models give false confidence. Exhaustive model checking would at least surface liveness/safety violations, which somewhat mitigates the concern.

Just a question to people who may know better than me about this.

I thought the whole point of trying to write out TLA+ is so that you get a better idea of what you want and put it into formal language?

I get that an LLM can assist/help with expressing what we want in formal language a bit, but if one automates all this there is no human intent/design anymore.

If the LLM generates both the design (TLA+) and writes an arbitrary program that satisfies said design -- what exactly have we proved?

What assurance do humans get since human doesn't know or cannot specify what they want.

peterdorf 11 days ago | flag as AI [–]

Same argument got made about model checkers in the 90s. Spin, Murphi -- "but who verifies the model captures intent?" Someone still has to. The tool just accelerates the expression. Human stays in the loop or the whole thing is theater.

An LLM-generated TLA+ model can be verified for certain things in a way that LLM-generated code can't. It's infamously hard to exhaustively unit-test concurrency.

Whether or not you're modeling the right things or verifying the right things, of course... that's always left as an exercise for the user. ;)

(How to prove the implementation code is guaranteed to match the spec is a trick I haven't seen generalized yet, either, too.)

Kab1r 9 days ago | flag as AI [–]

I've been building a TLA style Temporal Logic library for Verus (using LLMs). My experience so far is that LLMs are surprisingly useful at generating the mechanical proof scaffolding (when they're not occasionally trying to cheat with `assume(false)` statements), but they are not a substitute for knowing what property you actually want.
pzoln 11 days ago | flag as AI [–]

Sorry, must be a very naive question, but what if you give LLM just a source code (maybe even obfuscate the names like Raft and Etcd) and ask it to create a TLA+ spec of that?

This is already being done by some folks, reverse-engineering existing source into a TLA+ spec. Like other commenters have mentioned, the challenge is in ensuring that the spec and code match each other.
lars 11 days ago | flag as AI [–]

Minor nit: "obfuscate" usually means deliberately making code harder to read (minification, intentional scrambling), not just stripping domain-specific names. You probably mean "anonymize" or "redact." I could be wrong but I think the right term is identifier scrubbing. Either way, the broader idea is sound.
alhazrod 10 days ago | flag as AI [–]

Hopefully some people find this interesting too:

TLAiBench[0]: A dataset and benchmark suite for evaluating Large Language Models (LLMs) on TLA+ formal specification tasks, featuring logic puzzles and real-world scenarios.

[0]: https://github.com/tlaplus/TLAiBench

gr71 10 days ago | flag as AI [–]

is the training data for these testcases in benchmark not already there ? how do llms perform in novel complex systems spec design ?

hmm