- Building AI Agents
- Posts
- AI Agents That Matter
AI Agents That Matter
Better benchmarks for evaluating AI agents, fears of rogue agents wreaking havoc across the internet, the impact agents are having on software development, and more

🔍 Spotlight
Good evaluations are as important as good engineering when designing AI systems, but many of the existing benchmarks for AI agent performance are riddled with deficiencies. In their new paper AI Agents That Matter, a team of researchers at Princeton tackle this issue head-on, calling out a variety of drawbacks to popular agent benchmarks and proposing actionable solutions.
Just as evaluations such as MMLU, GPQA, and GSM8K have become ubiquitous for evaluating large language models (LLMs), a variety of similar benchmarks have been proposed to assess the performance of AI agents—complex, multi-step reasoning systems powered by these LLMs. These include WebArena, which challenges agents to solve web-browsing tasks; SWE-Bench and HumanEval, assessments of LLMs’ performance on code generation tasks; HotpotQA, which measures agents’ ability to solve questions that require multi-hop reasoning, and many more.
However, these benchmarks come with a host of flaws, which the paper spells out in detail. Typically, they measure an agents’ ability to correctly answer a question or write a piece of code, without accounting for the cost of the LLM API calls or computation that occurred behind the scenes, allowing agent developers to game the tests by expending massive amounts of compute to ensure accuracy on relatively simple problems. Furthermore, the small number of samples present in many benchmarks allows agents to overfit on the training set during fine-tuning, and benchmark developers often fail to designate a holdout set. Some agent builders cheat using shortcuts, such as hard-coding their systems to give responses tailored to the specific challenges in the assessment. Finally, many such benchmarks do not account for the degree of human interaction with the agent, meaning that much of an agents’ performance may be due to human assistance.
To build better benchmarks, the paper’s authors propose that effective agent evaluations should treat cost and performance as a 2D frontier, where each can be traded off for the other. A designated test set should be held out, with problems at a level of generality corresponding to the generality of the capabilities of the agents being tested. Human-in-the-loop operation should be explicitly planned for, with agents evaluated both with and without human assistance, and human intervention should be discouraged.
By incorporating these changes, future benchmarks can avoid the shortcomings which plague many existing ones and provide more robust evaluations for the wave of agentic systems being built today.
đź“° News
The founders of LangChain and LlamaIndex discussed the past, present, and future of AI agents at the AI Engineer World’s Fair, emphasizing the importance of moving beyond the initial hype of 2023 prototypes such as AutoGPT to a more pragmatic understanding of their capabilities.
đź§Ş Research
Function-calling is a critical skill for AI agents, but many models underperform in this domain. This paper by Salesforce researchers introduces APIGen, a pipeline to synthesize datasets for function-calling models. The researchers found that LLMs trained on this data can outperform ones orders of magnitude larger on function-calling tasks.
AI agents are a promising avenue for forecasting international events, but there are few objective evaluations available to assess their performance in this domain. MIRAI is a new benchmark for assessing agents as forecasters on tasks spanning a wide range of time horizons and events.
Emerging software engineering agents are complex and often expensive to run. This paper describes an approach it characterizes as agentless, in which software bugs are found and fixed via a simple two-step process of localization and repair.
GPT-RadPlan is a novel framework for planning radiotherapy treatments via a multi-agent system powered by GPT-4Vision, without any further fine-tuning. The authors find that its proposed approaches match or outperform human-generated treatment plans.
BMW Agents introduces a high-level multi-agent framework for automating a wide variety of enterprise tasks, addressing issues such as structuring communication within the agent network, implementing memory, and managing various stages of task execution.
🛠️ Useful stuff
Tools for building text-based agents abound, but multimodal tasks such as computer vision lag behind. Vision Agent allows users to describe a vision problem in natural language and easily generate code to solve it using multimodal LLMs.
Llama-agents is another framework for building multi-agent systems, which implements agents as services which receive and processes tasks from a message queue.
MindsDB allows AI models to be built, served, and fine-tuned from enterprise data. It integrates with a variety of data sources and ML frameworks, and includes support for agent-based systems.
đź’ˇ Analysis
A piece by Harvard professor Jonathan Zittrain in The Atlantic raises concerns that AI agents could be turned to nefarious purposes, or even persist to wreak havoc on the internet after being abandoned by their human builders. He argues for proactive steps by regulators and the private sector to mitigate these risks.
Software engineering agents such as Devin have created significant excitement and controversy with their promise of automating code development. This article reviews some of the tasks that agents seem genuinely capable of taking over, as well as others where the hype exceeds the reality.
A critical look at the popular LLM orchestration framework LangChain written by an engineer at the startup Octomind, which echoes many of the common complaints about LangChain—in particular, its tendency to make code more complex rather than simpler.