GPT-4.1 is OpenAI's flagship language model optimized for coding, instruction following, and long-context reasoning, released in April 2025. It supports a 1-million-token context window — over 8× the capacity of GPT-4o — and achieves 54.6% on SWE-bench Verified, representing a major improvement in real-world software engineering tasks. The model excels at precise code diffs, agent reliability, and high recall across large document contexts, making it well-suited for IDE tooling, automated coding agents, and enterprise knowledge retrieval.
API|VisionWeb SearchFile|Proprietary Model
Knowledge Cutoff
2024-06-30
Input → Output Format
Context Memory
1.0MIN33KOUT
AI Performance Evaluation
Arena Overall Score
1312
±4As of 2026-05-01
Overall Rank
No.216
100,105 Votes
Arena by Ability
Hard Prompts
1311±6No.222
Expert Knowledge
1286±12No.215
Instruction Following
1294±6No.213
Conversation Memory
1298±8No.215
Creative
1285±8No.203
Coding
1338±7No.223
Math
1303±8No.192
Arena by Occupation
Creative Writing
1306±6No.197
Social Sciences
1321±8No.220
Media
1290±8No.191
Business
1282±9No.235
Healthcare
1305±12No.220
Legal
1317±11No.223
Software
1324±6No.230
Mathematics
1308±8No.194
Source:Arena Intelligence
Overall
AA Intelligence Index
26%↓13%
ForecastBench
59%↑0%
Reasoning & Math
AA Math Index
35%↓40%
GPQA Diamond
67%↓16%
HLE
4.6%↓13%
MMLU-Pro
81%↓1%
AIME 2025
35%↓40%
MATH-500
91%↓2%
Coding
AA Coding Index
22%↓15%
LiveCodeBench
46%↓20%
TAU2
47%↓33%
TerminalBench
14%↓20%
SciCode
38%↓4%
Language & Instructions
IFBench
43%↓20%
AA-LCR
61%↓1%
Hallucination (HHEM)
5.6%↓5%
Factual (HHEM)
94%↑4%
Output Speed
Standard Mode
111tok/s↑34
First Output 0.57s