Building AI Agents
Posts
OpenAI o1 falls short

OpenAI o1 falls short

Plus: how to build a general agent in minutes, best practices for architecting all your agent systems, and more

Michael Cunningham
December 09, 2024

Welcome back to Building AI Agents, your biweekly guide to everything new in the AI agent field!

If you thought that coding the future isn’t exciting enough and want it to be a little more like TikTok, someone’s got the IDE for you

Thrilled to announce my new code editor startup: ZoomCode
It’s a new code editor made specifically for the cracked adhd Zoomer programmers, including a subway surfer / family guy clip sidebar to keep their attention from wandering
Can’t wait to present to @ycombinator
— Kristof (@ThuleanFuturist)
4:59 PM • Dec 5, 2024

In today’s issue…

o1 disappoints as an agentic model
DeepMind trains agents in simulated worlds
Build a general agent for all your tasks
Best practices for architecting agent systems

…and more

🔍 SPOTLIGHT

Source: OpenAI

If you ask OpenAI’s o1 reasoning model to think about whether it’s good option for agentic systems, you may induce some cognitive dissonance.

As part of its 12 Days of OpenAI event, the company unveiled the much-anticipated full version of its step-by-step reasoning o1 LLM. Originally released as a preview in September, o1 drew mixed reactions, with some being wowed by its capabilities while others feeling it was merely an incremental improvement over prior models—an effort to continue to juice LLM progress as the field broadly slows.

As a purpose-built multi-step reasoning model, o1 reproduces the commonly used chain-of-thought prompting method, generating variable numbers of hidden “reasoning tokens” as it thinks through its answer. In this way, o1 can be considered a sort of simple agent, iteratively manipulating its own outputs rather than simply producing a one-shot response.

However, the most critical attributes of a system that truly make it agentic are absent from o1. Currently, it does not support tool calling or structured outputs, two abilities that allow agents to invoke external resources to accomplish their goals—though OpenAI claims to be working on adding these features. The option to include a system message that instructs the agent on its goals is similarly absent. Furthermore, o1 can sometimes behave strangely when integrated into custom agentic architectures, since it was trained to follow a specific reasoning process—indeed, OpenAI actively discourages users from using prompts that ask the model to reason in any specific way. While o1 may be better than its predecessors at toy problems like finding the number of R’s in “strawberry”, its value as a drop-in option to replace existing LLMs in compound AI systems is currently limited. Even the newly-released full version fails to substantially outperform Claude 3.5 Sonnet and OpenAI’s own GPT-4o on several benchmarks of agentic tasks, falling short of the latter in several cases.

Consequently, it is not surprising that o1 has seen relatively little usage among agent builders. Significantly cheaper alternatives such as the aforementioned models and Meta’s open-source Llama series are available—GPT-4o’s API prices are 1/6 of those of o1-preview—which have the ability, unlike o1, to interact with the outside world using tool calls. Users can reproduce much of o1’s chain-of-thought reasoning with their own custom instructions, with the added benefit of greater customization. And if native multi-step thinking is truly essential, more affordable open-source Chinese alternatives are increasingly coming on to the scene.

Future versions of o1 armed with a more complete suite of features and fine-tuned to operate as part of larger agentic systems may someday re-assert OpenAI’s dominance in the agent space, but for now, other options are more…reasonable.

If you find Building AI Agents valuable, forward this email to a friend or colleague!

Refer one person and we’ll send you a set of weekly bonus links to additional AI agent news for a month! See here for a sample

🤝 WITH AI TOOL REPORT

There’s a reason 400,000 professionals read this daily.

Join The AI Report, trusted by 400,000+ professionals at Google, Microsoft, and OpenAI. Get daily insights, tools, and strategies to master practical AI skills that drive results.

📰 NEWS

Source: Google

Google uses synthetic worlds to train agents

Google DeepMind introduced Genie 2, a foundation model designed to generate playable 3D environments for embodied agents. This technique could potentially allow agents to be trained to superhuman performance on real-world tasks, just as DeepMind achieved superhuman intelligence on games by training its reinforcement learning agents on synthetic data.

Agentic prediction market traders

With AI agents and prediction markets both seeing surges of interest recently, a team of cryptocurrency traders has launched Polytrader, an agent designed to analyze news and trends to identify and trade mispriced contracts.

🛠️ USEFUL STUFF

Source: Medium

How to build a general-purpose LLM agent

An easy walkthrough on building a starter agent that can be turned to a wide variety of tasks.

Papers on GUI agents

A list of all the papers and projects cited in Large Language Model-Brained GUI Agents: A Survey, providing a comprehensive roundup of published research in the visual agent space.

A platform for building and deploying chat agents

Coval is a YC-backed startup whose technology allows users to monitor their voice and chat agents by running them through thousands of simulations to ensure that they handle all edge cases correctly.

💡 ANALYSIS

Garry Tan | Source: Y Combinator

Garry Tan’s walkthrough on Claude Computer Use

The Y Combinator President and CEO provides a primer on how to use Anthropic’s new GUI agent, as well as why the technology is potentially revolutionary.

Where to start with AI agents: an introduction for COOs

The author of this article walks chief operating officers (COOs)—and any other business leader looking to integrate agentic AI—through the essential considerations they must make.

AI agents: the next big thing in science?

This piece discusses the inroads that AI agents are making into scientific research, emphasizes the need for robust safeguards—and addresses the deeper philosophical question of what human scientists’ role will be in a future of automated research.

Crypto’s role in the AI agent economy

An exposition on the key part cryptocurrencies could play in facilitating agent financial transactions, including the drawbacks of traditional payment systems and how blockchain technology can address them.

🧪 RESEARCH

Source: Pix4free

Practical considerations for agentic LLM systems

This paper reviews best practices for building agent systems, grouping them into Planning, Memory, Tools, and Control Flow, and gives practical advice for overcoming difficulties in each of these domains.

Challenges in human-agent communication

This paper identifies 12 key challenges that occur as humans exchange information with AI agents, and provides guidance on how to build agentic systems that are robust to these potential pitfalls.

MISR: Measuring Instrumental Self-Reasoning in frontier models

Self-reasoning—the ability to introspect and analyze one’s own behavior—is a crucial skill for AI agents, but existing evaluations only assess it in non-agentic settings. The authors of this paper introduce MISR, a benchmark for frontier models’ ability to self-reason in agentic settings.

Thanks for reading! Until next time, keep learning and building!

What did you think of today's issue?

If you have any specific feedback, just reply to this email—we’d love to hear from you