OpenAI just released something that breaks a fundamental rule of modern AI: it's slow. On purpose. Meet o1—the model formerly known as "Strawberry" internally—which can take up to 30 seconds to think before responding. And honestly? The results are kind of mind-blowing.
We've gotten so used to instant AI responses that waiting feels weird. But o1 uses that time to work through complex reasoning chains, essentially "thinking out loud" internally before giving you an answer. It's the difference between blurting out the first thing that comes to mind and actually considering your response.
The Math Olympiad Moment
Here's the stat that made everyone sit up: on the 2024 American Invitational Mathematics Examination (AIME), GPT-4o solved an average of 1.8 out of 15 problems (12%). The o1 model? 11.1 problems correct (74%) with a single attempt. With 64 samples and consensus voting, it jumped to 83%. That's USA Mathematical Olympiad qualifying territory.
For context, AIME is designed to challenge the brightest high school math students in America. PhD students typically get these questions right. GPT-4o floundered. o1 essentially performs at the level of top-tier human mathematicians on these tests.
I tried it last week with a coding problem that had stumped me—one of those multi-step algorithmic challenges where you need to consider edge cases and optimize for performance. o1 took about 15 seconds, showed its reasoning process, and produced working code on the first try. GPT-4o had given me something that looked right but failed on corner cases.
Chain of Thought, But Hidden
The technical approach is interesting. OpenAI trained o1 using reinforcement learning to develop long internal chains of thought before responding. The model literally reasons through different strategies, recognizes its own mistakes, and tries alternative approaches—all before you see the final answer.
Here's the controversial part: OpenAI hides the actual chain of thought from users. You get a summary of what it considered, but not the raw internal reasoning. They cite AI safety and competitive advantage as reasons. Some developers are (understandably) annoyed about the lack of transparency. Others point out that o1 might generate concerning content during its thinking process that's better left unseen.
The safety angle is real, though. o1 scored "medium risk" on CBRN (chemical, biological, radiological, nuclear) weapons knowledge. A researcher from Stanford noted the model "outperforms PhD scientists most of the time on answering questions related to bioweapons." That's... not great? OpenAI says they won't release anything scoring higher than medium risk, but they're already bumping up against their own limits.
When Speed Isn't Everything
o1 isn't meant to replace GPT-4o for everything. For casual chat, summaries, or quick questions, GPT-4o is still better and way faster. o1 shines on complex reasoning tasks—physics problems, mathematical proofs, multi-step coding challenges, competitive programming.
The pricing reflects this. At $15 per million input tokens and $60 per million output tokens, o1 is 3-4x more expensive than GPT-4o. You're paying for all that thinking time and computational power. There's also a smaller version called o1-mini that's 80% cheaper and optimized specifically for coding tasks.
GitHub integrated o1-preview into Copilot the same day it launched. Microsoft added it to Azure AI services. The enterprise adoption is already happening, despite the preview label and limitations.
The Human Preference Test
OpenAI ran evaluations where human experts compared o1-preview and GPT-4o responses across different domains. For reasoning-heavy tasks like data analysis, coding, and math, people overwhelmingly preferred o1-preview. But for natural language tasks and conversational responses, GPT-4o often won.
This makes sense—o1 is optimized for thinking deeply about complex problems, not casual conversation. Someone I know who teaches computer science said their students are already using o1 for homework help on algorithm design, and it's noticeably better at explaining the logical steps than previous models.
The International Olympiad in Informatics (IOI) test is brutal—six challenging algorithmic problems, ten hours, 50 submission attempts per problem. A specialized version of o1 scored 213 points and ranked in the 49th percentile against human contestants. That's not winning, but it's competitive with actual olympiad participants.
What This Means for the Industry
The o1 release changes the conversation about AI capabilities. We're not just making models bigger or faster anymore—we're teaching them different types of thinking. Anthropic's Claude 3.5 and Google's Gemini both tout reasoning abilities, but o1 seems to be the first public model that truly slows down to think.
The applications are already obvious. Healthcare researchers can use it for analyzing complex cell sequencing data. Physicists are generating mathematical formulas for quantum optics. Software developers are tackling architectural problems that require considering multiple approaches and trade-offs.
But there's also a philosophical shift happening. For years, AI progress meant faster, more immediate responses. Now we're adding latency on purpose because the quality gain is worth it. That's a different paradigm.
My Honest Assessment
I've been playing with o1 for a couple weeks (access came before public release), and it genuinely feels different. There's something almost eerie about watching it "think" for 20-30 seconds before responding. You can almost imagine gears turning, even though you know it's just tokens and transformers.
The hallucination problem isn't solved—OpenAI's own research lead admitted that. But it's noticeably reduced. The model seems more aware of what it doesn't know and more likely to caveat uncertain responses.
Is it worth the price premium and wait time? For complex reasoning tasks, absolutely. For everything else, probably not. Which is fine—different tools for different jobs.
The bigger question is where this leads. OpenAI already announced o3 is coming (they skipped o2 for trademark reasons with the carrier O2). If the pattern holds, we'll see continued improvements in reasoning depth even if it means longer response times.
And honestly? I'm here for it. The instant-response AI race was getting ridiculous. Maybe it's okay if AI takes a beat to actually think things through.