Verity Benchmark
Work in progressA benchmark suite for AI-driven formal proof generation on real smart contracts.
Formal proofs are only useful if someone can write them. Today that requires deep expertise in both the contract logic and the proof assistant. We think AI agents can close this gap, but to get there, we need a way to measure how well they do it.
Verity Benchmark is an open database of formally specified properties drawn from real, deployed smart contracts. Each property comes with a Verity model of the contract logic, a Lean 4 theorem statement, and a harness that evaluates whether an agent's proof compiles and is correct.
The goal is simple: give agents a theorem and see if they can prove it. The better they get, the cheaper and more accessible formal verification becomes.
What's in the benchmark
The suite currently covers 5 protocols with 22 tasks across a range of property types: from conservation laws and economic invariants to state-transition correctness and tree structure proofs.
Each task includes the contract model, a formal specification, and a proof template. The agent's job is to fill in the proof. Difficulty ranges from straightforward arithmetic lemmas to multi-step reasoning over non-linear uint256 operations.
Why this matters
The bottleneck for formal verification today isn't the proof checker. Lean's kernel is fast and reliable. The bottleneck is writing the proofs. If AI agents can learn to generate valid Lean proofs from contract specifications, then formal verification stops being a luxury reserved for the highest-value contracts and becomes something every serious protocol can afford.
To train and evaluate these agents, they need data: real contracts, real properties, real proofs. That's what the benchmark provides: a growing collection of cases that any agent builder can use.
How it works
Each case starts from a deployed smart contract. We model the relevant logic in Verity, write a formal specification of the property to be verified, and create a proof template with the theorem statement but no proof body.
An agent receives the template and must produce a complete Lean proof. The harness compiles it against the fixed specification if it compiles, the proof is correct. No human review needed.
Reference solutions exist for every task but are kept separate from the public templates, so agents are evaluated on their ability to discover proofs independently.
Current status
This is an ongoing effort. We're actively adding new cases from important contracts across the ecosystem: lending protocols, DEXs, staking infrastructure, governance systems. The more diverse the benchmark, the better agents will generalize.
We hope this work will result in a publication documenting the benchmark design, early agent evaluation results, and what we've learned about the gap between current AI capabilities and production-grade formal verification.
Contributions are welcome, both new cases and agent implementations.