Looking At Groundbreaking Capabilities With OpenAI O3

It’s the end of ‘shipmas’, almost Christmas time, and OpenAI has given us some information about the pending model o3, and how it does its reasoning.

One of the most prominent demos is in this YouTube video with Sam Altman, who is joined by Mark Chen, Hongyu Ren, and special guest Greg Kamradt, to talk about o3; and related models.

“This model is incredible at programming,” Altman says as they look at benchmarks like GPQA Diamond for Ph.D-level science questions; and EpochAI frontier for math, where o3 demonstrates breakout results.

As demonstrated, the model is getting good marks against practical testing of skilled human professionals.

The group also discussed the use of these new models for SWE-bench operations, or in other words, for implementing real-world software tasks.

Some Scientific Notes on Advancement

OpenAI has also published a recent explanation of some of the science in o3 and newer models. It’s called “deliberative alignment” and it has to do with extending chain of thought operations and training models on safety specifications.

“Despite extensive safety training, modern LLMs still comply with malicious prompts, over-refuse benign queries, and fall victim to jailbreak attacks,” spokespersons explain. “One cause of these failures is that models must respond instantly, without being given sufficient time to reason through complex and borderline safety scenarios. Another issue is that LLMs must infer desired behavior indirectly from large sets of labeled examples, rather than directly learning the underlying safety standards in natural language. This forces models to have to reverse engineer the ideal behavior from examples and leads to poor data efficiency and decision boundaries. Deliberative alignment overcomes both of these issues. It is the first approach to directly teach a model the text of its safety specifications and train the model to deliberate over these specifications at inference time. This results in safer responses that are appropriately calibrated to a given context.”

In addition, to show off how this works, OpenAI provides a demo of the computer finding evidence of wrongdoing and failing to comply with a demand.

Deliberative alignment, the researchers claim, will do better than reinforcement learning from human feedback (RLHF) and something called RLAIF.

“Deliberate alignment training uses a combination of process- and outcome-based supervision,” spokespersons write. “We first train an o-style model for helpfulness, without any safety-relevant data. We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data. We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model with a strong prior for safe reasoning. Through SFT, the model learns both the content of our safety specifications and how to reason over them to generate aligned responses. We then use reinforcement learning (RL) to train the model to use its CoT more effectively. To do so, we employ a reward model with access to our safety policies to provide additional reward signal. In our training procedure, we automatically generate training data from safety specifications and safety-categorized prompts, without requiring human-labeled completions. Deliberative alignment’s synthetic data generation pipeline thus offers a scalable approach to alignment, addressing a major challenge of standard LLM safety training—its heavy dependence on human-labeled data.”

Feedback from Humans

In the above video, Greg Kamradt of ARC AGI goes over how o3 is knocking it out of the park on the proprietary methods that ARC uses to assess logical expertise: a series of pixel-based tests where the machine, or the human, has to figure out a pattern.

“When we actually ramp up to high compute, o3 was able to score 85.7% on the … holdout set,” he said. “This is especially important because human performance is comparable at 85% threshold. So being above this is a major milestone, and we have never tested a system that has done this, or any model that has done this beforehand. So this is new territory in the ARC AGI world.”

Many others are also talking about how the model represents a landmark in the quick march toward AGI and even the singularity.

“The introduction of the o3 models highlights the untapped possibilities of AI reasoning capabilities,” writes Amanda Caswell at Tom’s Guide. “From enhancing software development workflows to solving complex scientific problems, o3 has the potential to reshape industries and redefine human-AI collaboration.”

That’s only part of what people are saying about this model! I’m seeing charts flying around showing exponential leaps toward AGI, and asking when we will announce that we have achieved this benchmark as a society.

So let’s keep an eye on what these models are doing as 2024 winds down.