HM.AI: OpenAI releases GPT-5.3-Cdex: the strongest model for intelligent agent programming, with a speed increase of 25%

On February 7, 2026, OpenAI officially released the GPT-5.3-COdex through its official blog and social platform X, which is officially defined as the "most powerful agent programming model to date". This update not only marks the transition of Codex from a simple "code completion assistant" to an "all-in-one research and development agent", but also achieves a significant breakthrough in underlying performance: inference speed has been improved by about 25%, and industry records have been broken in key benchmark tests such as SWE BenchPro and Terminal Bench.

It is worth noting that GPT-5.3-Cdex is the first model in OpenAI's history to be rated as "High Capability" in the field of cybersecurity under the Preparedness Framework. This classification not only demonstrates the enormous potential of the model in vulnerability identification and defense, but also prompts OpenAI to launch stricter classification deployment and governance measures. This release coincides with the intense competition in the field of AI programming, which has been dubbed by industry media VentureBeat as a new wave of "AICodingWars". It directly benchmarks against competing models released at the same time, indicating that 2026 will be the year when "intelligent agent programming" will be fully commercialized.

Quick overview of key highlights

The core upgrade of GPT-5.3-Cdex focuses on expanding the capabilities, execution efficiency, and security boundaries of intelligent agents. Here are the 8 key pieces of information released this time:

Positioning upgrade: Evolved from a "code assistant" to an "agentic" capable of autonomously planning, executing long process tasks, and managing computer operations.

·Speed improvement: Thanks to the optimization of infrastructure and inference stack, the model's running speed has increased by about 25% compared to the previous generation, significantly reducing the waiting delay of human-computer interaction.

·SOTA: Achieved an accuracy rate of 56.8% in SWE BenchPro, which measures the true software engineering capabilities, setting a new industry record.

·Qualitative change in terminal operation: Scored 77.3% in Terminal Bench 2.0 testing, a significant improvement compared to the previous generation GPT-5.2-Code (64.0%), greatly enhancing command line operation capabilities.

·Desktop control capability: scored 64.7% in the OSWorld Verified benchmark, far ahead of the 38.2% of the previous generation, demonstrating its operational potential in visual desktop environments (according to SystemCard).

·Breakthrough in security classification: The first model in the field of network security to be rated as "High capability", with strong vulnerability detection capabilities, accompanied by the launch of a $10M defense fund.

·Self evolution: This is OpenAI's first model that "participates in self creation", and early versions were used for debugging training runs, managing deployments, and diagnosing test results.

·Distribution channel: Open to ChatGPTPlus and above paying users from today onwards, covering CodexApp, CLI, IDE plugins, and web end.

From Model to Product: The Meaning of Intelligent Agent Programming

OpenAI defines GPT-5.3-Codex as the 'strongest agent programming model', with the core keyword being 'Agenticic'. Unlike the traditional single interaction mode of "user input prompt ->model output code", AgenticCoding means that the model has a "metacognitive" cycle similar to that of human engineers: thinking ->acting ->observing ->iterating. It is no longer just passively completing functions, but can actively take over long process tasks.

In terms of product form, this ability shift means that Codex can work independently like a junior colleague. OpenAI's official blog revealed a typical productization scenario: self development acceleration. The Codex team utilized early versions of GPT-5.3-Cdex to 'debug their own training process'. The model can not only identify rendering bugs in infrastructure, but also analyze the root cause of low cache hit rates in logs, and even dynamically scale GPU clusters during release to cope with traffic spikes.

For developers, this means a qualitative change in the way they interact. For example, when a user requests to "build a web game", GPT-5.3-Codex will not just spit out a piece of HTML code, but will create a project structure, write core logic, run tests, and if an error is encountered, it will autonomously read the error log (TerminalOutput), analyze the cause, modify the code, and run again until the test is passed. During this process, users can view the "thinking process" and "task list" of the model in real-time through CodexApp's interface, and intervene in correction at any time, instead of waiting for the final result in a black box.

Benchmarking and Efficiency: Interpretation of Hard Metrics

In order to quantify the engineering capabilities of GPT-5.3-Cdex, OpenAI has released a series of hardcore benchmark test data. These data not only demonstrate the progress of the model in coding, but also highlight its "hands-on" ability in operating computer environments. The following is a detailed interpretation of the core benchmark:

Benchmark name	Measuring ability	GPT-5.3-Cdex score	Compare GPT-5.2-COdex
SWE-BenchPro	Real software engineering capabilities (cross 4 languages, anti pollution, simulating real GitHub issue resolution)	56.8%	56.4% (slight increase)
Terminal-Bench2.0	Command line/terminal operation capabilities (Shell commands, environment configuration, log analysis)	77.3%	64.0% (significant increase)
OSWorld-Verified	Visual operating system interaction (GUI operation, multi application collaboration, desktop workflow)	64.7%	38.2% (qualitative leap)
GDPval	Professional knowledge and work ability (covering 44 types of professional tasks, such as table analysis and PPT production)	70.9%	(No direct comparison data, flat GPT-5.2)

In depth analysis:

The gold content of SWE BenchPro: OpenAI emphasizes that SWE BenchPro is more challenging than the standard SWE BenchVerified. It is no longer limited to Python, but covers four mainstream programming languages and is designed with stronger "anti data pollution" mechanisms. A score of 56.8% means that the model has reached the average level of mid to senior engineers in solving complex and unseen real-world software issues.

·The evolution of terminals and OS: The biggest highlight of this update is the data leap of Terminal Punch (77.3%) and OSWorld (64.7%). This indicates that GPT-5.3-COdex is no longer a geek who can only type in the IDE, but a versatile operator who can proficiently use the Linux command line and operate the visual desktop UI like a human. The enhancement of this capability is the key foundation for realizing the promise of 'Agentic'.

·Less tokens, faster speed: In addition to accuracy, efficiency is also a core selling point. Official data shows that GPT-5.3-Codex consumes significantly fewer tokens than its predecessor models when completing the same task, which means it can "think more clearly and do it more directly", reducing meaningless trial and error loops. With a 25% inference acceleration, the overall project time will be significantly reduced.

GPT-5.3-Cdex's intelligent agent architecture diagram: shows how the main agent (MainAgent) disassembles tasks and distributes them to specialized sub agents (Subagents # 1, # 2), and finally aggregates context to complete complex workflows.

Security and Cybersecurity: Risk Boundaries and Governance

With the leap in model capabilities, security boundaries are also expanding. GPT-5.3-Cdex brings a milestone signal: it is the first model in OpenAI's history to be rated as "High Capability" in the field of cybersecurity.

According to OpenAI's SystemCard disclosure, the model scored as high as 77.6% in the CTF (Capture The Flag) cybersecurity challenge (compared to 67.4% in the previous generation). OpenAI's Prepared Framework is a four tiered (low, medium, high, extremely high) framework used to assess the potential risks of cutting-edge models. A "High" rating means that the model has expert level capabilities in discovering software vulnerabilities, writing exploit code, and conducting network attack and defense simulations.

This' double-edged sword 'capability has triggered OpenAI's high vigilance and targeted governance:

·Defense first governance strategy: OpenAI has explicitly stated that although there is currently no conclusive evidence to suggest that the model can "automate end-to-end network attacks," in order to prevent them in advance, the most comprehensive network security defense stack (CyberSafetyStack) has been deployed to date. This includes targeted security training, automated threat monitoring, and enforcement pipelines based on threat intelligence.

TrustedAccess pilot: In order to make "high capabilities" serve defenders rather than attackers, OpenAI has launched the "TrustedAccess for Cyber" pilot project. This means that advanced features involving sensitive network security operations may not be open to all API users, but may require access by audited trusted researchers or enterprises.

·$10M Defense Fund: OpenAI promises to provide API quotas worth $10 million specifically to support projects utilizing GPT-5.3-Codex for network defense research, particularly for the protection of open source software and critical infrastructure. This aims to make the evolution speed of the defending side faster than that of the attacking side through asymmetric resource investment.

OpenAI's Prepared Framework schematic diagram: As GPT-5.3-Cdex touches the "High capability" red line, the network security defense system is also upgraded synchronously, covering multi-layer mechanisms from data protection to active threat blocking.

Industry Impact: AI Programming Competition Rises Again

The release of GPT-5.3-COdex is not an isolated event. According to VentureBeat, OpenAI was released in the same week or even on the same day as its competitor Anthropic released ClaudeOps4.6. This confrontational rhythm is referred to by industry observers as a white hot manifestation of the "AICodingWars" (AI programming wars).

In this confrontation, the focus of competition between the two sides has shifted from a simple "benchmark score" to a deeper dimension of "productization" and "ecological niche" competition:

·Comprehensive benchmarking of intelligent agent capabilities: Anthropic's ClaudeOpus 4.6 also focuses on long context and intelligent agent planning capabilities, and has performed well in some community evaluations (such as reaching 65.4% on Terminal Punch). OpenAI, on the other hand, attempts to gain a foothold in "practical efficiency" by achieving a Terminal Punch score of 77.3% and a speed improvement of 25%. Sam Altman directly stated on social media, "This era belongs to Builders

·The difference in platformization strategy: OpenAI is building a closed and tightly integrated development ecosystem through CodexApp (desktop), IDE plugins, and APIs, attempting to become an "operating system" for enterprise level development. Competitors, on the other hand, integrate more into third-party tools such as cursor and VSCode through APIs. The release of GPT-5.3-Cdex further consolidates OpenAI's "self operated terminal" moat.

·The competition for enterprise level market: According to the latest data, the expenditure of enterprises on LLM is growing at an astonishing rate, but OpenAI's market share is being squeezed by Anthropic and Google. The launch of GPT-5.3-Cdex at this time is not only for technical demonstration, but also to demonstrate to enterprise customers its irreplaceability in "handling complex tasks" and "security".