Optimizing code performance is paramount in software engineering, yet it remains a largely unexplored frontier for Large Language Models (LLMs). While models excel at fixing bugs, their ability to make code faster at a repository-scale is not well understood.
To address this, we introduce SWE-Perf, the first benchmark meticulously designed to evaluate LLMs on performance optimization tasks within genuine, complex repository contexts. Unlike benchmarks that focus on isolated code snippets, SWE-Perf challenges models to understand and modify entire codebases. The benchmark comprises 140 instances, each derived from a real performance-improving pull request on a popular GitHub repository. For each instance, a model is provided with the full source code, a specific performance-related test, and the human expert's solution for reference. The core task is to generate a code patch that reduces the test's execution time without introducing bugs.
Models receive a codebase and performance tests. Success is measured by the runtime gain of the generated patch.
We built a rigorous pipeline to mine and validate 140 instances from projects like `scikit-learn` and `sympy`.
Our evaluation of top models (Claude 3, DeepSeek) and agentic frameworks reveals that AI currently lacks the high-level reasoning of human experts.
Performance Gain: 2.26%
Strategy: Micro-optimizations. Focuses on low-level, local changes like `append`, `get`, and `set`.
Performance Gain: 10.85%
Strategy: Macro-optimizations. Performs high-level architectural changes involving components like `parser`, `state`, and `listener`.
Performance Gap: AI must learn architectural thinking.
SWE-Perf introduces a challenging new frontier for LLMs: real-world code performance optimization. Our findings highlight a significant gap between current AI capabilities and human expertise, primarily due to a lack of architectural reasoning in models. By open-sourcing our benchmark, we aim to spur research that closes this gap and pushes models toward generating truly production-ready, performant code.
@article{he2025sweperf, title={SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?}, author={He, Xinyi and Liu, Qian and Du, Mingzhe and Yan, Lin and Fan, Zhijie and Huang, Yiming and Yuan, Zejian and Ma, Zejun}, year={2025} }