SWE-Perf Logo: A llama on a rocket next to the text 'SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?'
1Xi'an Jiaotong University 2TikTok 3National University of Singapore 4University of California San Diego

What is SWE-Perf?

Optimizing code performance is paramount in software engineering, yet it remains a largely unexplored frontier for Large Language Models (LLMs). While models excel at fixing bugs, their ability to make code faster at a repository-scale is not well understood.

To address this, we introduce SWE-Perf, the first benchmark meticulously designed to evaluate LLMs on performance optimization tasks within genuine, complex repository contexts. Unlike benchmarks that focus on isolated code snippets, SWE-Perf challenges models to understand and modify entire codebases. The benchmark comprises 140 instances, each derived from a real performance-improving pull request on a popular GitHub repository. For each instance, a model is provided with the full source code, a specific performance-related test, and the human expert's solution for reference. The core task is to generate a code patch that reduces the test's execution time without introducing bugs.

The Task

SWE-Perf Task Workflow

Models receive a codebase and performance tests. Success is measured by the runtime gain of the generated patch.

The Data

SWE-Perf Data Collection Pipeline

We built a rigorous pipeline to mine and validate 140 instances from projects like `scikit-learn` and `sympy`.

Key Finding: A Massive AI-Human Gap

Our evaluation of top models (Claude 3, DeepSeek) and agentic frameworks reveals that AI currently lacks the high-level reasoning of human experts.

Best AI Model

AI Model Word Cloud

Performance Gain: 2.26%

Strategy: Micro-optimizations. Focuses on low-level, local changes like `append`, `get`, and `set`.

Human Expert

Human Expert Word Cloud

Performance Gain: 10.85%

Strategy: Macro-optimizations. Performs high-level architectural changes involving components like `parser`, `state`, and `listener`.

8.59%

Performance Gap: AI must learn architectural thinking.

Conclusion

SWE-Perf introduces a challenging new frontier for LLMs: real-world code performance optimization. Our findings highlight a significant gap between current AI capabilities and human expertise, primarily due to a lack of architectural reasoning in models. By open-sourcing our benchmark, we aim to spur research that closes this gap and pushes models toward generating truly production-ready, performant code.

BibTeX

@article{he2025sweperf,
    title={SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?},
    author={He, Xinyi and Liu, Qian and Du, Mingzhe and Yan, Lin and Fan, Zhijie and Huang, Yiming and Yuan, Zejian and Ma, Zejun},
    year={2025}
}