New Model Reasoning: An Engineer’s Take

The models are coming out fast and furiously – it seems like every time we turn around, we have new forms of LLM operations and AI engines to make sense of.

But what do these changes actually do in the industry?

I came across this X post ostensibly from Dr. Tim Scarfe at Machine Learning Street Talk, where someone who evidently has experience with these technologies discusses what a breakthrough the o1-pro model is, and why.

Essentially, Scarfe says, the new model changes the iterative process through which engineers prompt LLMs to perform complex tasks.

“The biggest apparent change with o1-pro is the complexity it can handle in a ‘single shot,’” he writes. “Previously, LLMs could only do ‘so much work’ in a single forward pass, and there were weird restrictions we had to subconsciously internalize due to the self-attention linearization hacks, i.e. you could only ever ask LLMs to address and do work inside an amorphous limited subspace of the context.”

He also points out that the traditional process isn’t actually a ‘single shot,’ but that a parallelized search tree process is in play.

A Postage Stamp of Attention

In addition, Scarfe uses the postage stamp analogy to talk about the constrained capability of last-generation attention mechanisms.

“Imagine you had a world map,” he writes, “and in every forward pass of an LLM you could only perform a ‘postage stamps worth’ of computation, and you decided as a prompter where to place the postage stamp on the map. That’s pretty much how LLMs worked before o-series. So we as engineers designed ways to place more postage stamps, or subdivide the map and aggregate the results into something coherent.”

He explains how engineering teams tried to get around these limitations with multi-agent collaboration and other techniques.

“o1-pro now automates this for us with less need for prompt hacking (and/or) engineering from us,” he adds.

He also refers to transformers as “finite state automata,” saying they’re extremely limited, again, in the types of computation they can do in a single forward pass.

Notwithstanding the semantics of automata, that makes sense. (Strictly speaking, chatGPT has this to say: “(Transformers are) a continuous and parameterized computational framework and thus are outside the classic, discrete automata model.”)

There’s a certain subjectivity there; I just thought that was interesting. Anyway, those who are discovering these model capabilities (and using them) are helping the AI systems to organize their resources in different ways to become more capable and more versatile.

What’s the Difference?

Scarfe also describes the difference that the new model makes to users this way: – “more verbosity, more diversity and less banality.”

And, at the end of the day, more accuracy.

Let’s look at these criteria in a bit more detail.

Verbosity has to do with the ways that the models speak to us and answer our questions. You can frame it this way: is the LLM a Shakespeare or a kindergartner? As for diversity, when the model can search better at inference, it can deliver wider-ranging results. And banality – well, that has a little bit to do with the uncanny valley. I’ve written about how early LLM results were “simple,” “generic,” in a word, yes, “banal.” In other words, it’s the nuance and complexity of the result that passes a deeper Turing test.

And in terms of accuracy:

“(The new model is) now spreading out 1000 postage stamps on the map, capturing exactly the information which matches and answers my prompt,” Scarfe writes. “The difference is night and day.”

Deep Thoughts from Francois Chollet

At the end of the post, Scarfe references Francois Chollet, a renowned voice in AI research who left Google to work on the Arc prize. I’ve covered his work in prior posts, where the AI engine tries to solve a pattern recognition problem that humans can do without too much trouble.

Navigating over to Chollet’s own X feed, you can see that he is optimistic about what recent models have done to solve the Arc problem.

“Today OpenAI announced o3, its next-gen reasoning model,” Chollet wrote Dec. 20. “We’ve worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute) and 87.5% in high-compute mode (thousands of $ per task). It’s very expensive, but it’s not just brute — these capabilities are new territory and they demand serious scientific attention.”

Here are some other interesting statements that Chollet has made lately about the state of the AI industry.

“Computing used to feel fast — everything ran locally, software was mostly in C/C++ and was kept in check by the need to run on all kinds of old hardware. Now any one of my Chrome tabs is using 100x more RAM than a NeXT workstation had in total.” – Sept. 3, 2024

“The current climate in AI has so many parallels to 2021 web3 it’s making me uncomfortable. Narratives based on zero data are accepted as self-evident. Everyone is expecting as a sure thing ‘civilization-altering’ impact … in the next 2-3 years.” – Jan. 8, 2023

And here’s one with great relevance to the markets:

“Software is this weird space where you can spend basically nothing and create a billion dollars of value, or spend a billion dollars and create basically no value,” Feb. 1, 2022

In Conclusion

Here is some of what I find relevant to today’s engineering world, as we discover new model capabilities. I say discover, not build, because the systems themselves are endowed with capabilities that amaze humans. Watch this space for more on what new models will do in the future.