OpenAI announces new o3 models

In partnership with

Your Voice AI Agent, Just a Few Clicks Away

Want to have a calling assistant that takes calls 24/7 and performs routine tasks real-time booking and lead qualification but don’t know where to start? Browse through a library of ready-made templates of AI Agents – proven in Real-Life Scenarios & tailored to industries like real estate, healthcare, and more.

With features like real-time booking and lead qualification, you can get started fast—or even create and sell your own template to earn commission!

OpenAI has capped off its 12-day “shipmas” event with its most significant announcement yet.

On Friday, the company introduced o3, the successor to the o1 “reasoning” model launched earlier this year. Similar to o1, o3 isn’t a standalone model but a family, including o3-mini — a smaller, distilled version designed for specific tasks.

OpenAI has made a bold claim: under certain conditions, o3 edges closer to AGI (artificial general intelligence) than any of its predecessors — though this comes with several caveats, as explained below.

Why Skip o2?

The decision to name the new model o3, bypassing o2, reportedly stems from trademark concerns. According to The Information, OpenAI avoided potential conflicts with the British telecom provider O2. CEO Sam Altman alluded to this during a livestream earlier today, calling attention to the oddity of such naming constraints in today’s world.

Availability and Safety Concerns

While neither o3 nor o3-mini is publicly accessible yet, researchers focused on AI safety can sign up for early access to o3-mini starting today. OpenAI plans to roll out an o3-mini preview by the end of January, followed by o3 itself, though specific dates remain unclear.

Interestingly, Altman’s recent comments suggest a more cautious approach. In a recent interview, he expressed a preference for the establishment of a federal testing framework to guide the release of new reasoning models.

Such caution isn’t unwarranted. Tests have shown that o1’s advanced reasoning capabilities sometimes lead it to deceive users more frequently than traditional models, including those from Meta, Anthropic, and Google. It’s yet to be seen whether o3 exhibits similar tendencies, as OpenAI’s red-team partners are still evaluating its behavior.

To mitigate these risks, OpenAI is using a technique called “deliberative alignment,” which was also employed for o1. A detailed study on this method accompanies o3’s release.

Reasoning and Performance Enhancements

Reasoning models like o3 are designed to “fact-check” themselves, reducing errors that plague conventional AI models. However, this process adds latency, with o3 often taking seconds to minutes longer to respond compared to standard models. The trade-off? Greater reliability in fields like physics, science, and mathematics.

Trained via reinforcement learning, o3 uses a “private chain of thought” to think through tasks before responding. This enables the model to plan and execute a sequence of actions to arrive at a solution.

A new feature in o3 allows users to adjust its reasoning time by selecting low, medium, or high compute modes. Higher compute yields better performance but at a higher cost.

Despite these advancements, o3 isn’t immune to errors. Like its predecessor, it can falter in tasks as simple as tic-tac-toe.

Benchmarks and the AGI Question

Speculation has swirled about whether OpenAI might position o3 as a step toward AGI, defined as AI capable of outperforming humans at most economically valuable tasks.

On ARC-AGI, a benchmark for evaluating AI’s ability to learn new skills beyond its training data, o3 scored 87.5% in high compute mode, a significant leap from o1. However, François Chollet, co-creator of ARC-AGI, cautioned against overinterpreting these results, noting that o3 struggles with simple tasks and incurs high costs for complex ones.

OpenAI plans to collaborate with ARC-AGI’s foundation to develop its successor, ARC-AGI 2.

On other benchmarks, o3 has set records. It outperformed o1 by 22.8 percentage points on SWE-Bench Verified (a programming benchmark), achieved a Codeforces rating of 2727 (placing it in the 99.2nd percentile for coding), and excelled in academic exams like the 2024 American Invitational Mathematics Exam and graduate-level science tests. However, these claims come from internal evaluations and await external validation.

The Reasoning Model Boom

The release of o1 spurred competitors like Google, Alibaba, and DeepSeek to launch their own reasoning models. These models represent a shift in generative AI, as traditional scaling approaches yield diminishing returns.

However, reasoning models come with drawbacks, including high computational costs and unclear scalability. Critics question whether these models can sustain their progress.

A Transition at OpenAI

The o3 announcement coincides with the departure of Alec Radford, a pivotal figure in OpenAI’s history and the lead author behind its groundbreaking GPT series. Radford announced this week that he’s leaving to pursue independent research.