- Building AI Agents
- Posts
- A new benchmark for software engineering agents
A new benchmark for software engineering agents
ML-Bench tests agents' abilities to reason across an entire code repository

🔍 Spotlight
Software engineering is one of the most promising use-cases for large language model (LLM) agents, with efforts to build “AI software engineers” attracting both hype and controversy. Projects such as Cognition Labs’ Devin AI and its open source imitators OpenDevin, Devika, and SWE-agent have created a mix of excitement over their potential and concern about the threat they could represent to human software engineers’ jobs.
A more prosaic question, however, is what the actual extent of their capabilities is. LLMs now regularly achieve impressive results on the HumanEval coding benchmark, which consists of relatively simple problems, but to truly compete at the level of human programmers, coding agents need to work with entire code repositories—tracking down bugs, writing tests, creating, moving, and deleting files, and more. More rigorous benchmarks to measure these capabilities have heretofore been lacking.
A consortium of researchers at Yale, Nanjing, and Peking universities recently released ML-Bench, a collection of challenges requiring agents to make use of entire codebases, drawn from 18 notable machine learning libraries on GitHub. ML-Bench consists of two main components—ML-LLM-Bench and ML-Agent-Bench. While the former requires only that an LLM generate executable code based on an instruction, ML-Agent-Bench involves end-to-end task execution, including setting up environments, navigating repositories, installing dependencies, and executing commands in sequence.
The authors tested several open-source software development agents, finding that OpenDevin performed the best, solving over 76% of the ML-Agent-Bench challenges at an average cost of only 25 cents. However, they noted that even the best systems struggled with hallucination and selection of the correct language, highlighting challenges to be overcome as the tasks given to agents become more and more complex.
The code to implement ML-Bench is available on GitHub. It runs on via a secure Linux kernel within a Docker container, allowing agents being tested to execute arbitrary actions without risk to the evaluator’s machine. ML-Bench represents an important evaluation tool in the effort to develop AI agents that can perform software engineering at the human level.
đź“° News
Databricks announced a slew of new tools at its Data + AI Summit, including Mosaic AI Agent Framework and Mosaic AI Agent Evaluation. These new capabilities respectively allow users to build and evaluate agent systems in Databricks’ Mosaic AI platform.
The London-based startup Zeta Labs unveiled Jace, an LLM agent targeted towards consumers and small businesses. The agent, which will cost $45/month once publicly available, will allow users to automatically execute complex tasks in a web browser.
Robotic process automation (RPA) company Automation Anywhere announced AI Agent Studio, a low-code suite for building agentic workflows similar to Langflow or Flowise.
đź§Ş Research
GuardAgent is a novel agent designed to protect other agents against unsafe behavior by writing code that can check their inputs and outputs against user-defined safety and privacy guidelines.
Automated control of mobile devices is a field of growing interest in the AI agents space, but existing benchmarks suffer from a variety of issues such as scalability, robustness, and realism. MobileAgentBench addresses these drawbacks with a new set of 100 tasks across 10 daily Android apps.
This paper by Google researchers proposes the Personal Health Insights Agent (PHIA), a system which brings together data from a user’s wearable devices, web information, and code generation capabilities to provide personalized answers to health questions.
The latest effort to allow LLMs to interact with computer GUIs using only visual data rather than text elements, CAAP achieves a high success rate of over 94% on the MiniWoB++ benchmark of browser-based tasks.
The authors of this paper propose “mixture-of-agents”, an analog to mixture-of-experts which solves queries by passing them through several “layers” of LLM calls which refine the responses. Mixture-of-agents leads to improved performance on several benchmarks of text-based reasoning, although it is questionable whether it can be considered truly “agentic”, as it does not involve active, multi-step task solving.
🛠️ Useful stuff
WebLLM is an engine for running LLMs in-browser using local GPU acceleration. It aims to enable autonomous agents that execute online tasks for users, such as booking calendars, drafting emails, and buying tickets.
The latest in a series of agent courses by Andrew Ng’s learning platform DeepLearning.AI, this one focusing on using Azure OpenAI and LangChain to build an agent to interact with SQL databases.
đź’ˇ Analysis
In the latest issue of his newsletter The Batch, Andrew Ng addresses the contentious question—what makes a system an “agent”?
Researchers at the Harvard Kennedy School of public policy propose the use of LLM agents to simulate political poll responses. By accessing up-to-date web information, agents can more accurately reflect real respondents’ likely opinions than vanilla LLMs, whose knowledge is frozen at their training time.
The author of this piece argues that anthropomorphizing even sophisticated AI agents is a mistake that can lead to unnecessary hype or panic.