Ethereum Verification Benchmark
PreprintEvaluating AI deductive reasoning for Smart Contracts.
Formal verification helps find serious errors in smart contracts, but writing machine-checked proofs still takes expert work. The benchmark tests whether AI agents can produce those proofs, not just explain a property or write a unit test.
Ethereum Verification Benchmark contains 135 Lean 4 proof tasks from 25 deployed or ecosystem-relevant Ethereum cases. Each task comes with a Verity model of the contract logic, a fixed formal specification, and one editable proof file.
The v0.1 release reports comparable runs for GPT-5.5, GLM 5.2, Opus 4.8, MiniMax M3, Kimi K2.7, DeepSeek V4 Flash, DeepSeek V4 Pro, Grok 4.3, MiMo V2.5, and Leanstral-2603. The best run verifies 66 of 135 tasks. No included run solves half of the suite.
Paper
Read the preprint (PDF)
Ethereum Verification Benchmark: Evaluating AI Agents on Lean 4 Proofs for Smart Contracts.
v0.1 release
The suite is strongest today on accounting, solvency, storage effects, and linked-list ownership structures. It is thinner on reentrancy, oracle manipulation, governance, liveness, cross-contract composition, and adversarial EVM behavior.
How grading works
Agents receive public task files and a proof template. Hidden reference proofs are not present in the workspace. The verifier rejects sorry, admit, new axioms, theorem-statement edits, and unavailable imports.
A task counts as verified only when the submitted proof passes every verifier gate and rebuilds in an isolated workspace. Version manifests bind results to task and environment fingerprints, so scores can be reproduced.
How it works
Each benchmark case starts from an Ethereum contract or ecosystem-relevant contract pattern. The case fixes the modeled contract behavior and the property to prove.
The agent may edit only the designated proof file. It succeeds only by producing a complete Lean 4 proof for the fixed theorem.
This keeps scores focused on proof generation rather than changing the problem being graded.
Current status
The public leaderboard ranks comparable release runs by pass rate. It excludes runs that were not executed under the same harness. The preprint reports the v0.1 design, suite composition, release results, and a MiniMax M3 budget-scaling sidecar.
The main result is still conservative: current agents can solve many tasks, but complete proof generation for smart contract verification is not close to solved.
Contributions are welcome, both new benchmark cases and agent runs.