SorryDB: Can AI Provers Complete Real-World Lean Theorems?

SorryDB is a dynamically-updating benchmark built from 78 real-world GitHub projects to evaluate AI systems on formalizing mathematical proofs in the Lean theorem prover. It addresses test-set contamination and ecological validity by sourcing tasks from active repositories, with evaluations showing agentic approaches using models like Gemini Flash perform best. The benchmark aims to produce AI tools genuinely useful for mathematicians by focusing on complex, dependency-rich proofs.

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

SorryDB: A Dynamic Benchmark for AI-Driven Mathematical Formalization

Researchers have introduced SorryDB, a novel, dynamically-updating benchmark designed to rigorously evaluate AI systems on the task of formalizing mathematical proofs in the Lean theorem prover. Unlike static benchmarks composed of competition-style problems, SorryDB is built from 78 real-world, open-source formalization projects on GitHub, creating a continuous stream of authentic tasks that reflect the actual needs of the mathematical community. This approach aims to produce AI tools that are genuinely useful for mathematicians by focusing on complex, dependency-rich proofs rather than isolated puzzles.

Aligning AI Evaluation with Real-World Mathematical Work

The core innovation of SorryDB lies in its dynamic nature and its grounding in community-driven projects. By sourcing its tasks—specifically "sorry" placeholders that need proof completion—from active GitHub repositories, the benchmark ensures that progress on it translates directly to an AI's ability to contribute to novel, ongoing research. This methodology directly addresses critical issues in AI evaluation for formal mathematics, such as test-set contamination from models trained on public data and the misalignment of static benchmarks with practical usability.

As an expert in AI for science, this represents a significant step toward ecological validity in benchmarking. Evaluating models on curated competition problems can lead to overfitting and tools that fail on the messy, interconnected proofs found in real research. SorryDB's continuously updated task stream provides a more robust and future-proof metric for an AI agent's capacity to be a collaborative partner in mathematical discovery.

Benchmarking Current AI and Symbolic Approaches

To establish a performance baseline, the researchers evaluated a diverse array of methods on a snapshot of 1,000 tasks from SorryDB. The tested approaches included generalist large language models (LLMs), agentic systems that can perform multi-step reasoning, specialized symbolic provers, and even a simple curated list of common Lean tactics. This comprehensive evaluation reveals the current state of the field and the complementary strengths of different paradigms.

The results showed that an agentic approach utilizing Google's Gemini Flash model was the most performant overall. However, it was not strictly dominant; other off-the-shelf LLMs, specialized provers, and the basic tactic list all succeeded on distinct subsets of problems. This suggests that the ideal future system may be a hybrid, leveraging the strategic planning of agents, the broad knowledge of LLMs, and the precise, reliable reasoning of symbolic methods.

Why This Matters for AI and Mathematics

  • Bridges the Usability Gap: SorryDB pushes AI development toward creating tools that are aligned with practicing mathematicians' workflows, moving beyond solving artificial challenges to assisting with genuine research bottlenecks.
  • Mitigates Benchmarking Pitfalls: Its dynamic, community-sourced design helps prevent test-set contamination, ensuring evaluations measure true generalization and learning capability rather than data memorization.
  • Highlights Complementary Techniques: The finding that agentic, LLM, and symbolic methods are complementary underscores that the path to advanced AI for formal math will likely involve integrated, multi-strategy systems.
  • Accelerates Formal Verification: By providing a robust evaluation framework, SorryDB can accelerate progress in automated proof assistance, which is critical for verifying complex mathematical theories and software correctness.

常见问题