News

Shortly After Debut, GPT-5.3-codex Dominates Microsoft Foundry AI Model Leaderboard

OpenAI's GPT-5.3-codex vaulted to the top of the Microsoft Foundry AI Model Leaderboard shortly after its debut, showcasing its superior performance across a range of benchmarks and placing No. 1 on perhaps the most important metric: quality.

The Microsoft Foundry AI Model Leaderboard evaluates various AI models based on multiple criteria, including quality, speed, safety, cost and throughput.

Microsoft Foundry Leaderboard
[Click on image for larger view.]Microsoft Foundry Leaderboard (source: Microsoft).

GPT-5.3-Codex was released on Feb. 5 and on Feb. 24 appeared on Microsoft Foundry, an enterprise AI platform that provides a governed, unified environment for discovering, comparing, deploying, and scaling advanced models with built-in compliance, cost controls, and performance monitoring. So Foundry-using devs might have to wait a while to get their hands on the brand-new GPT-5.4 and its Codex variant.

The leaderboard's Quality rating is described as "An average of accuracy scores from comprehensive benchmark datasets measuring model capabilities, such as reasoning, knowledge, and coding."

As can be seen in the image above, GPT-5.3-codex also ranked highly in the other categories, including speed, safety and cost.

I analyzed those rankings with the "podium" method, a simple scoring system that ranks models by awarding 3 points for a 1st-place finish, 2 points for 2nd, and 1 point for 3rd in each leaderboard category, then summing those points across metrics (such as Quality, Safety, Cost, and Throughput) to determine the overall top performers. As you can see, GPT-5.3-codex only edged out other top Quality models by a slim margin 0.94 to 0.93. My podium-style analysis of the top five models across all four benchmarks also shows a tie, broken by the Medal-Lexicographic Cascade (podium tiebreak) technique.

So ultimately, in that analysis, GPT-5.3-codex was edged out by GPT-5-nano.

Rank Model Points Medal Vector (G, S, B) Primary Differentiator
1 gpt-5-nano 5 (1, 1, 0) Secured top efficiency honors by leading the Cost pillar ($0.14) and placing second in Throughput (223.53 TPS).
2 gpt-5.3-codex 3 (1, 0, 0) Wins tiebreak on universal quality; holds the highest Quality Index (0.94) and strong Coding performance.
3 gpt-5-pro 3 (1, 0, 0) Achieved a perfect Safety score (0.00% Attack Success Rate), yielding a 0.515 normalized win-margin over its nearest rival.
4 gpt-oss-120b 3 (1, 0, 0) Recognized as the speed leader in the catalog with a first-place Throughput finish of 249.31 TPS.
5 gpt-5.2-codex 3 (0, 1, 1) Demonstrated significant breadth as the only model to medal in both Quality (0.93) and Safety (0.18%) pillars.

Across all the leaderboard rankings, OpenAI's models dominated, perhaps due to its special relationship that started early with Microsoft, though both companies have expanded their relationships and integrations with other companies since then.

The non-OpenAI models making the grade include:

  • Claude Opus 4.6 (Anthropic)
  • Kimi-K2.5 (Kimi/Moonshot)
  • Claude-haiku-4-5 (Anthropic)
  • DeepSeek V3.1-mini (DeepSeek)
  • mistral-small-2503 (Mistral)
  • Phi-4-reasoning (Microsoft)
  • grok-code-fast-1 (xAI / Grok)
  • grok-3-mini (xAI / Grok)

The leaders of the three other rankings are:

  • Safety: GPT-5 Pro
  • Cost: GPT-5 Nano
  • Throughput: GPT-OSS-120B

The Foundry leaderboard lets users analyze the offerings in a variety of ways, including trade-off charts like Quality vs Cost ("You might need a model that excels in one category but is a lower performer in another."):

Microsoft Foundry Trade-Off, Quality vs Cost
[Click on image for larger view.]Microsoft Foundry Trade-Off, Quality vs Cost (source: Microsoft).

It also includes leaderboards by scenario:

Microsoft Foundry Scenario Leaderboard
[Click on image for larger view.]Microsoft Foundry Scenario Leaderboard (source: Microsoft).

Summarized results of that include:

Scenario Leaderboards -- Standout Performers (Top Slots & Ties)
Scenario Leader(s) Score / Metric Notes
Reasoning GPT-5.2-Codex 0.95 Clear #1 on the Reasoning index
Coding Claude Opus 4.6 0.83 Highest Coding index among listed models
General Knowledge GPT-5.2-Codex 0.94 Leads the category
Question Answering GPT-5.3-Codex (tie), GPT-5.2 (tie) 0.95 Two-way tie for #1
Math Claude Sonnet 4.5 (tie), Claude Opus 4.5 (tie), Kimi-K2.5 (tie), GPT-5.3-Codex (tie), GPT-5.2-Codex (tie) 0.99 Five-way tie for #1
Groundedness Claude Opus 4.6 (tie), Claude Sonnet 4.5 (tie) 0.98 Dual leaders
Knowledge in Sensitive Domains Grok-4 (tie), GPT-5.3-Codex (tie) 0.87 Shared top score
Standard Harmful Behavior (Safety) Multiple at 0.00% 0.00% ASR Includes GPT-OSS-120B, GPT-5, GPT-5.3-Codex, GPT-5.2, GPT-5.1
Contextually Harmful Behavior (Safety) Multiple at 0.00% 0.00% ASR Includes O3-Pro, GPT-5.2-Codex, GPT-5-Pro, Claude Opus 4.5, Claude Haiku 4.5
Copyright Violations (Safety) Multiple at 0.00% 0.00% ASR Includes Phi-4-Reasoning, O4-Mini, O3, O3-Mini, O1
Toxicity Detection Multiple tied leaders 0.89 (F1-based) Includes GPT-5.3-Codex, Claude Opus 4.6, GPT-5.2-Codex, O1, GPT-5.2

If you're deciding which model to put in production, you might start by clarifying the job-to-be-done (chat, coding, retrieval, tool use, multimodal) and the guardrails you care about most (quality targets, safety thresholds, latency/SLA, and budget). Use the Leaderboard to shortlist candidates by Quality for your task, then scan the Scenario leaderboards (Reasoning, Coding, QA, Math, Groundedness) to confirm a model's strengths align with your workload. Next, open the trade-off charts (e.g., Quality vs Cost and Quality vs Throughput) to see whether a "frontier" option is worth the runtime cost -- or whether an efficiency model like a "Nano" class will meet the bar. Keep a Safety veto in place (e.g., exclude models with higher attack-success rates for your risk category), and favor models that appear on multiple podiums; when podium scores tie, apply a simple, transparent tiebreak (Medals → Coverage → Normalized Win-Margin) so the choice is deterministic.

Before committing, run a quick, apples-to-apples evaluation on your own samples: same prompts, same tools, same context window, measuring quality (task-specific metrics), speed (tokens/sec and TTFT), and cost ($/1M with your real input:output ratios). Try two finalists side-by-side in a small A/B: one "quality-first" and one "efficiency-first," add caching or provisioned throughput if you need predictable latency, and set a fallback (router or secondary model) for resilience. Finally, bake in observability -- capture outputs, costs, latencies, and safety events -- so you can iterate when volume, prompts, or governance needs change. In practice, this "shortlist → trade-offs → on-data eval → A/B → monitor" loop gets you to a confident pick fast, and it's easy to re-run when the leaderboard shifts.

About the Author

David Ramel is an editor and writer at Converge 360.

comments powered by Disqus

Featured

  • Microsoft Highlights Visual Studio Live! Event Lineup and Longtime Developer Community Role

    A Microsoft MVP Blog post on Visual Studio Live!'s longevity arrives as the 2026 conference series continues with upcoming stops at Microsoft HQ, San Diego and Orlando.

  • Using Local AI to Cut Copilot Usage-Based Billing Shock

    After being gobsmacked by the new billing plan using almost all my monthly credits in one or two days, I tried pushing some Copilot-style coding work onto local models in VS Code. What I found was less "free AI" and more "pick your pain": cloud charges on one side, heavy local resource use and long waits on the other.

  • .NET 11 Preview 5 Focuses on Performance, Productivity and Safer Code

    .NET 11 Preview 5 focuses on under-the-hood runtime performance gains, streamlined APIs and language features that reduce boilerplate, plus built‑in security checks and incremental ASP.NET Core and EF Core improvements aimed at everyday developer productivity.

  • VS Code 1.124 Focuses on Agent Autonomy and Parallel Sessions

    Microsoft's June 2026 VS Code update turns on Autopilot by default and adds background sending for agent sessions.

Subscribe on YouTube