OpenAI's GPT-5.3-Codex tops benchmarks, claims it helped build itself

OpenAI released GPT-5.3-Codex on Wednesday, timing the announcement to coincide exactly with Anthropic's Claude Opus 4.6 launch. The synchronized releases mark an escalation in AI coding competition, with both companies also running competing Super Bowl ads this Sunday.

The new model posts 57% on SWE-Bench Pro, a contamination-resistant benchmark spanning four languages. It scores 77.3% on Terminal-Bench 2.0, versus Claude Opus 4.6's reported 65.4%. On OSWorld, which tests visual desktop task completion, it achieves 64%.

Performance comes with efficiency gains: 25% faster inference and half the tokens of GPT-5.2-Codex for equivalent tasks. OpenAI's Codex usage doubled since mid-December, with over 1 million developers using it last month. Rate limits doubled for paid plans on February 2.

The notable claim: early versions of GPT-5.3-Codex helped build itself. The team used it to debug training runs, manage deployment infrastructure, and diagnose test results. "Our first model that was instrumental in creating itself," according to OpenAI.

Beyond coding

OpenAI positions GPT-5.3-Codex beyond pure code generation. The model handles debugging, deployment, product requirements, slide decks, and spreadsheet analysis. This expansion targets the broader enterprise productivity market, where Microsoft, Salesforce, and ServiceNow are embedding AI agents.

The model is OpenAI's first classified as "High capability" for cybersecurity under its Preparedness Framework. It's trained to identify vulnerabilities, though OpenAI says it lacks evidence the model can automate end-to-end attacks.

Enterprise considerations

For CTOs evaluating coding AI, three patterns emerge from community discussion:

Pricing: API access coming soon. Current ChatGPT paid plans include access via app, CLI, IDE, and web. Claude pricing remains token-based with different rate structures.

Production deployment: GPT-5.3-Codex supports parallel agent workflows via git worktrees, avoiding merge conflicts. Cold start latency improvements claimed but unverified in production at scale.

Migration paths: Teams moving between platforms report integration overhead varies by codebase size and language. TypeScript projects show strong performance across both platforms.

DeepSeek V4, arriving mid-February with 1M+ context window and open weights, adds another variable. The real test: which model ships working code at acceptable cost when rate limits and latency matter.

Beyond coding

Enterprise considerations

Related Articles

Anthropic's 16 parallel Claude agents built working C compiler for $20K

Anthropic releases Claude Opus 4.6 for financial research, claims 300K business users

OpenAI launches Frontier agent platform, name clashes with own model terminology