What does OpenAI's o3 mean for agents?

Plus: Anthropic's cookbook for effective agents, hundreds of enterprise agent case studies from Google, and more

In partnership with

Welcome back to Building AI Agents, your biweekly guide to everything new in the AI agent field!

If you’re a time traveler from the 90’s looking to bring agents to your company, I found the perfect guide for you—just remember to make them Y2K-proof! Oh, and invest in a little startup called Google. Trust me.

Due to the Christmas holiday, there will be no edition of Building AI Agents this Thursday. Issues will resume as normal beginning Monday, December 30. Happy Holidays!

In today’s issue…

  • o3 stuns the AI community—will it lead to better agents?

  • Anthropic’s cookbook for effective agents

  • 321 enterprise agent case studies from Google

  • How the agent wave differs from traditional SaaS

…and more

🔍 SPOTLIGHT

o3 vs o1 on ARC-AGI | Source: arcprize.org

OpenAI just took a significant step towards artificial general intelligence (AGI), at least if a benchmark of that name is to be believed.

On Friday, the company unveiled o3, a reasoning model in the same mold as its earlier o1, and quickly set the AI world aflame with debate over its implications. Like o1, o3 “thinks” by iteratively producing a set of hidden dialogue, in which the model logically reasons towards its conclusion, which is then presented to the user. o3 substantially improved o1’s performance on a variety of benchmarks, from coding evaluations to measures of mathematical reasoning to graduate-level science problems, and more. Some highlights include a score of 96.7% on the 2024 American Invitational Mathematics Exam and 25.2% on EpochAI’s Frontier Math benchmark—on which no other model exceeds 2%.

But it was o3’s performance on Francois Chollet’s infamously difficult Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) that raised the most eyebrows. ARC-AGI, as its name suggests, is intended to spur the development of artificial general intelligence by being trivially easy for humans, but highly challenging for AI, a task at which it has so far succeeded. Human testers average over 75%—with STEM graduates exceeding 95%—while even o1 scores only 32% on its most expensive reasoning setting. o3, however, rocketed to human-level performance with an astonishing 88%. Some skeptics noted that this result was only achieved by allowing the model a huge number of reasoning tokens, racking up a bill in the thousands of dollars. However, with model costs dropping by orders of magnitude per year, capabilities which cost thousands now may well cost pocket change in a surprisingly short time.

As a recent edition of Building AI Agents noted, o1, though impressive in many respects, was a questionable choice of model for agent builders. It lacked critical features such as function calling, a system prompt, and structured outputs, and its performance did not significantly exceed other models, particularly when accounting for its cost. Some of these drawbacks have just been addressed—as part of its rollout of the full version of o1, OpenAI included support for the three features listed above. However, the incentive to switch from other LLMs remained slim, with similar capabilities achievable at a lower cost with more traditional models or newer Chinese imitators. o3 promises to soon turn this dynamic on its head, with levels of performance unattainable anywhere else, regardless of price. It remains to be seen how well o3 will hold up as an engine for agentic reasoning, but according to OpenAI CEO Sam Altman, we may soon find out, as the company plans to release a mini version of the model in January and follow with the full one soon after.

The question of whether o1 really deserves the august title of AGI will undoubtedly continue to be the subject of considerable debate, hinging as much on philosophy as on computer science. Regardless, if the results released by OpenAI are to be believed, it represents a considerable step forward in AI technology, and one whose public release agent builders should eagerly await.

If you find Building AI Agents valuable, forward this email to a friend or colleague!

🤝 WITH AI TOOL REPORT

There’s a reason 400,000 professionals read this daily.

Join The AI Report, trusted by 400,000+ professionals at Google, Microsoft, and OpenAI. Get daily insights, tools, and strategies to master practical AI skills that drive results.

📰 NEWS

The Thinker by Rodin | Source: Encyclopedia Britannica

The tech giant announced Gemini 2.0 Flash Thinking Experimental, a reasoning model similar to o1 and o3, as part of its broader pivot towards agentic AI with Gemini 2.0.

A new update to AG2, the 3rd party fork of Microsoft’s open-source agent framework AutoGen created by much of the original team, includes the ability to build real-time voice agents, enabling new interfaces such as phone calls.

The new feature for the company’s Copilot software is the latest in the growing competition in visual agents, kicked off by Anthropic Computer Use in October.

🛠️ USEFUL STUFF

Source: Wikipedia

The company shares lessons from its experience building agents, including when (or whether) to use agent frameworks, different patterns of agent conversation, and a cookbook with some reference implementations.

A massive roundup of generative AI systems built by companies using Google’s AI technology across a wide range of industries.

A tutorial by Amazon Web Services on using their Bedrock Agent service in conjunction with CrewAI to create a multi-agent system in AWS cloud.

A quick primer on agentic RAG and how it achieves the superior performance that is leading it to displace traditional RAG.

💡 ANALYSIS

Source: Wikimedia Commons

This article compares the arrival of AI agents to the SaaS wave of the early 2000s, focusing on the differences in how each paradigm meets the needs of business customers.

Not to be confused with the company’s recent State of AI Agents report, this one takes a broader view of the AI field but still gives agents some attention, finding that they represent a rapidly growing share of LLM-powered systems.

Sayan Chakraborty of SaaS powerhouse Workday discusses the power that LLMs are bringing to companies, comparing them to the steam engines that provided the general purpose workhorse of the Industrial Revolution.

Some metrics and observations by the agent framework startup, drawn from responses from over 4,500 builders using Crew’s tech.

Drawing on Marc Benioff’s observation that agents represent a step change from earlier forms of AI, this article points out the differences between agents and the simpler chatbots and other genAI applications which preceded them, as well as the considerations that will need to be met to fully realize agentic AI’s potential.

🧪 RESEARCH

TheAgentCompany’s workflow | Source: arXiv

TheAgentCompany is a new benchmark designed to assess the performance of AI agents on the kind of tasks a real digital worker might perform, such as browsing the Web, writing code, running programs, and communicating with coworkers.

The authors of this paper (fittingly) propose and evaluate a new framework called Proposer-Agent-Evaluator for using reinforcement learning (RL) to train web browsing agents. A proposer suggests tasks for the agent to attempt in order to hone its skills, then the evaluator grades the results, providing a reward signal that enables the agent to be enhanced through RL.

A white paper by OpenAI defining agentic AI systems and the set of best practices judged necessary to safely operate them in the real world.

Thanks for reading! Until next time, keep learning and building!

What did you think of today's issue?

Login or Subscribe to participate in polls.

If you have any specific feedback, just reply to this email—we’d love to hear from you

Follow us on X (Twitter), LinkedIn, and Instagram