Ethereum Verification Benchmark

Deductive reasoning.

We evaluate frontier models on fresh problems derived from production Ethereum contracts. Each task pairs a Verity contract model with a fixed formal specification. A model succeeds only if it produces a proof accepted by Lean.

Leaderboard

#	Model	Pass rate	Passed	Failed	Total
1	gpt-5.5	48.9%	66	69	135
2	glm-5.2	39.3%	53	82	135
3	opus-4.8	33.3%	45	90	135
4	grok-build-0.1	30.4%	41	94	135
5	minimax-m3	28.1%	38	97	135
6	kimi-k2.7	23.0%	31	104	135
7	v4-flash	22.2%	30	105	135
8	v4-pro^*	22.2%	28	98	126
9	grok-4.3	18.5%	25	110	135
10	mimo-v2.5	9.6%	13	122	135

Pass rate is the share of proofs that compiled against the fixed specification. Total is the number of tasks, out of 135, for which the model returned a parseable proof. An asterisk* marks models evaluated on only part of the 135 tasks.

Scaling Law

We re-ran MiniMax M3 on 50 tasks solved by the frontier model with steadily larger effort budgets. Its solve rate climbs from 12% to 96%, approaching a near-perfect rate as effort grows.

Estimation: S(E) = 1 / (1 + (4.6M / E)^1.19)

About the benchmark

The benchmark was developed with support from the Ethereum Foundation as part of our work to make formal verification of smart contracts more practical. Instead of asking a model to explain whether a contract is safe, Verity asks it to produce a mathematical proof that a specific formal property follows from a specific contract model and specification.

It contains 135 proof tasks derived from 11 Ethereum contracts and built with input from protocol teams. Each task includes a reference proof and a frozen grading setup. The model may attempt the proof, but it may not change the theorem statement or specification.

A proof is considered correct only if it is accepted by the Lean 4 kernel, with no additional assumptions.

Methodology & datasetHow cases are built and graded View source on GitHubTasks, specs & grading harness

Every task, specification, and grading harness is open source. Results are fully reproducible.

In collaboration with