Ethereum Verification Benchmark
Deductive reasoning.
We evaluate frontier models on fresh problems derived from production Ethereum contracts. Each task pairs a Verity contract model with a fixed formal specification. A model succeeds only if it produces a proof accepted by Lean.
Leaderboard
| # | Model | Pass rate |
|---|---|---|
| 1 | gpt-5.5 | 48.9% |
| 2 | glm-5.2 | 39.3% |
| 3 | opus-4.8 | 33.3% |
| 4 | grok-build-0.1 | 30.4% |
| 5 | minimax-m3 | 28.1% |
| 6 | kimi-k2.7 | 23.0% |
| 7 | v4-flash | 22.2% |
| 8 | v4-pro* | 22.2% |
| 9 | grok-4.3 | 18.5% |
| 10 | mimo-v2.5 | 9.6% |
Pass rate is the share of proofs that compiled against the fixed specification. Total is the number of tasks, out of 135, for which the model returned a parseable proof. An asterisk* marks models evaluated on only part of the 135 tasks.
Scaling Law
We re-ran MiniMax M3 on 50 tasks solved by the frontier model with steadily larger effort budgets. Its solve rate climbs from 12% to 96%, approaching a near-perfect rate as effort grows.
About the benchmark
The benchmark was developed with support from the Ethereum Foundation as part of our work to make formal verification of smart contracts more practical. Instead of asking a model to explain whether a contract is safe, Verity asks it to produce a mathematical proof that a specific formal property follows from a specific contract model and specification.
It contains 135 proof tasks derived from 11 Ethereum contracts and built with input from protocol teams. Each task includes a reference proof and a frozen grading setup. The model may attempt the proof, but it may not change the theorem statement or specification.
A proof is considered correct only if it is accepted by the Lean 4 kernel, with no additional assumptions.
Every task, specification, and grading harness is open source. Results are fully reproducible.
In collaboration with





