Can agents cut labor costs by 97%?

LLM agents are faster and cheaper than humans on software tasks, Mistral enters the agent game, and more

🔍 Spotlight

Much of the excitement around AI agents focuses on their promise of performing humanlike tasks at high speed and at a fraction of the cost of human labor. According to a new analysis, that vision may now be within reach.

Model Evaluation and Threat Research (METR) is a non-profit dedicated to assessing the potential risks that advanced AI systems pose to human society, which has collaborated with major players in the AI space such as Anthropic, OpenAI, and the UK AI Safety Institute. While the kind of doomsday scenarios METR exists to guard against seem far-fetched with today’s systems, a new report by the organization aims to quantify the ability of agents built on the most sophisticated large language models (LLMs) available to perform human tasks. Such agents could be the early precursors of the type of dangerously advanced systems AI safety advocates fear.

To assess agents’ capabilities, METR researchers assembled a benchmark of tasks related to cybersecurity, software engineering, and machine learning, and gathered human baselines using testers with STEM degrees and technical work experience. These jobs took humans anywhere from less than 5 minutes to over 16 hours on average. The agents METR challenged with these tasks were based on two of the most advanced LLMs available today: OpenAI’s GPT-4o and Anthropic’s Claude 3.5.

The results were striking: the agents succeeded more often than they failed on assignments that required humans up to 30 minutes to complete, and often in a minute fraction of the time. Because the price of human labor is determined by work time while the price of AI agents is determined by token use, they converted the cost of the agents’ LLM calls to the average wage of a bachelor’s degree holder in the US, finding that the former was fully 97% cheaper on average. Tasks which took a human up to 2 hours could be often be performed by an agent for less than $2.

The news was not all good for agents. Many of the jobs were beyond their ability to accomplish, with additional token usage only increasing the completion rate up to a budget of around 200,000 tokens—beyond this, the agents’ performance plateaued. The report’s conclusion, in brief, is that for the set of tasks which agents are capable of performing, they vastly outperform humans on speed and cost. Nevertheless, there is good reason to believe that this set will expand quickly as more intelligent LLMs become available, including ones purpose-built for agentic use cases, and as agent architectures improve.

METRs results provide an striking window into a future of rapid automation of human cognitive labor, and will undoubtedly fuel debate about its implications for human society—both positive and negative.

đź“° News

Every day seems to bring a new batch of agent startups emerging from stealth with millions of dollars in venture funding. This X post lists some of the most significant ones, their sector, and how much they raised.

Bardeen announced that it had secured additional funding to develop its enterprise agent platform. Unlike many agent companies, Bardeen aims to build general-purpose agents integrated with over 100 common business programs such as Microsoft’s 365 suite.

French LLM provider Mistral has released a simple codeless solution to build “agents” with its platform, though these seem to consist of one-step, few-shot LLM prompts rather than sophisticated systems.

đź§Ş Research

One potential risk of multi-agent systems is the possibility that malicious actors will introduce deliberately dysfunctional agents designed to sabotage their functioning. This paper tests the viability of this attack, finding that some agent systems are more impacted than others.

One of the great debates in the study of cognition is between symbolism and connectionism, with the latter being given a strong boost by the recent success of LLMs. The authors of this paper argue that LLM-based agents represent a synthesis of the two approaches, with the language model providing the connectionist component and the language it produces performing the symbolic operations.

Human organizations accomplish complex tasks by decomposing them into subtasks and dividing them between their members, who may, in turn, delegate sub-subtasks to subordinates, and so forth. ReDel is a new toolkit which gives LLM agents the ability to do the same, recursively spawning subordinate agents to whom sub-tasks are delegated. This approach provides a performance increase on several complex planning benchmarks.

Effective software engineering often requires a holistic understanding of how an entire code repository is structured and connected, which has been a challenge for software engineering agents. This paper introduces CodexGraph, which addresses this difficulty using a structure graph extracted from the code repository to guide the agents’ work.

🛠️ Useful stuff

Many of the struggles that today’s LLMs have with agentic tasks can be attributed to their lack of specialized fine-tuning on these roles. Agent startup Orby has released ActIO, an LLM purpose-built for automating enterprise tasks by incorporating capabilities such as visual grounding and long-term planning.

This article is the first in a series of four by LangChain on different types of user experiences (UX), or interfaces, which can be used to interact with AI agents.

LangGraph Engineer is the prototype of an agent which can translate a description of an agentic system by the user into a scaffold in LangGraph—in essence, employing an agent to help the user build other agents.

Lyzr Automata, created by startup Lyzr, is the newest of many low-code frameworks for building multi-agent systems.

đź’ˇ Analysis

A podcast conversation hosted by VentureBeat on the promise and challenges of AI agents for enterprise use cases.

An X thread on the future of software engineering and its downstream industries as cheap, effective software engineering agents become available.