- Building AI Agents
- Posts
- Can AI run a company?
Can AI run a company?
Plus: 10 cool projects with Google’s Agent Development Kit, the 4 types of people interested in AI agents, and more

Welcome back to Building AI Agents, your biweekly guide to everything new in the field of agentic AI!
Web search functionality has, in a way, made LLMs worse to use. "That's a great question. I'm a superintelligence but let me just check with some SEO articles to be sure."
— John Collison (@collision)
10:41 PM • Apr 23, 2025
You’re not really augmenting an LLM with web search if all the articles they’re web searching were written by LLMs.
REFERRAL BONUS: 🕙💰 For a limited time, our AI agent implementation agency, OryntAI, is offering a 10% bonus to readers who help small and midsize businesses find our services.
Refer a customer who signs on with us to build an agent for them, and we’ll pay you 10% of the total value of the contract!
In today’s issue…
Testing agent capabilities with an all-AI company
Anthropic says AI employees are a year away
10 cool projects with Google’s Agent Development Kit
The 4 types of people interested in AI agents
IBM claims $3.5 billion productivity boost with agents
…and more
🔍 SPOTLIGHT

Recently, I wrote about the prospect of billion-dollar companies with just one employee. “What about zero?” a group of researchers respond.
In a new paper, a team at Carnegie Mellon University reveal TheAgentCompany, an effort to assess the ability of LLM-powered agents to automate all aspects of an business’s operation. The study simulates a small software company whose AI agent “employees” communicate and work via cleverly-named analogs to real enterprise software—GitLab instead of GitHub, OwnCloud instead of Microsoft Office, RocketChat instead of Slack, and so forth. The 175 tasks they are given, such as "arrange a meeting room for the quarterly review” and “analyze a spreadsheet of user metrics and produce a summary report,” span software engineering, finance, project management, and more.
So how did they do? In short, results were mixed. The best model, Anthropic’s Claude 3.5 Sonnet, successfully completed only 24% of the tasks, with Google’s Gemini 2.0 Flash (11.4%), OpenAI’s GPT-4o (8.6%), and Meta’s open-source Llama 3.1 405b (7.4%) performing worse. The success rate varied by role type: 3.5 Sonnet completed nearly 36% of project management tasks but just over 8% in finance.
Some articles reporting on the study were quick to ridicule the agents’ mediocre performance, especially a failure in which one tasked with asking a question to another user in the company chat couldn’t find them, and “solved” the problem by renaming another user to the intended recipient’s name.
However, these sneering takes miss the forest for the trees. First, the study clearly showed that agent performance increases by leaps and bounds as models improve. For instance, while Llama 3.1 70b, released in July of last year, achieved a success rate of only 1.7%, its December cousin Llama 3.3 70b, a model of the same size, clocked in at 6.9%—a four-fold improvement in just five months. Qwen managed an even more impressive five-fold increase between models released three months apart. With 3.5 Sonnet already handling a quarter of tasks successfully, it doesn’t take a mathematician to recognize that a surprisingly high success rate could be just around the corner.
The purpose of the study, indeed, was not to demonstrate that LLMs could fully automate an entire company today, but to establish a benchmark that could be used to track where we are along that road. Once a “difficult” benchmark is established, it doesn’t tend to stay difficult for long. Humanity’s Last Exam, built by a massive coalition of researchers to contain the most difficult academic questions an AI could be confronted with, debuted in January with the best LLM (OpenAI o1) achieving only a 13% success rate—OpenAI’s Deep Research based on o3 doubled that score just 9 days later. ARC-AGI, deliberately designed to capture the kind of pattern recognition humans excel at but AI struggles with, saw little progress until 2024, when OpenAI’s model performance on it soared from o1’s 13.3% to o3’s 87.5%. And let’s not forget the Turing Test.
Of course, a benchmark being beaten does not necessarily mean the underlying task will be immediately automated. Benchmarks capture performance under idealized lab conditions, not the messy ones of reality. In the short term, Jensen Huang’s vision of a hybrid company in which humans manage armies of AI agents is far more achievable than the fully autonomous enterprises implied by TheAgentCompany.
The study, rather, illuminates the road we are following towards an automated economy. We aren’t there yet, and the road is long—but our vehicle is moving very, very fast.
🤖 AI AGENTS FOR YOUR BUSINESS

Orynt AI: The official implementation agency of Building AI Agents
🕙💰 LIMITED TIME ONLY: Refer a future customer to us and get 10% of the total value of the contract as a bonus!
Discover exactly where AI can streamline your operations, saving you time, reducing overhead, and improving consistency.
Automate routine tasks with agents tailored specifically to your business needs.
Unlock serious growth and efficiency by scheduling your free 30-minute strategy call.