- Building AI Agents
- Posts
- Microsoft introduces Windows Agent Arena
Microsoft introduces Windows Agent Arena
A new benchmark for Windows control agents

Welcome back to Building AI Agents, your weekly guide to everything new in the AI agent field!
AI agents are now on the verge of being used as podcast hosts. Would you tune into The Joe Robot Experience?
In today’s issue…
Microsoft introduces a benchmark for Windows control agents
Salesforce debuts its Agentforce platform
CrewAI launches a new forum for agent developers
The AI agents making an impact in enterprises now
A study on the best LLMs for powering agents
…and more
🔍 SPOTLIGHT

Source: Microsoft
In the 1990s, Microsoft Windows nearly took over the computing world. Now, Microsoft’s computer agents are trying to take over Windows.
Computer agents, or computer control agents, are a type of AI agent designed to accomplish tasks in a desktop environment, taking the computer’s screen as an input and outputting actions such as scrolling, clicking, and entering text. Such agents potentially have enormous economic value, promising to save PC users significant amounts of time by automating rote, low-skill tasks. As a result, they have attracted considerable interest from major AI players, including tech giants such as Google and OpenAI, scrappy startups like Adept, and academic institutions. In May, a collaboration between the University of Hong Kong, Carnegie Mellon University, Salesforce Research, and the University of Waterloo produced OSWorld, a benchmark of 369 computer control tasks across Linux, Windows, and macOS.
Now, Microsoft is entering the arena (so to speak) with Windows Agent Arena, a benchmark which expands on OSWorld by adding a significant number of tasks for the Windows operating system. Windows Agent Arena consists of 154 challenges in the Office, Web Browsing, Windows System, Coding, Media & Video, and Windows Utilities domains, as well as a virtual environment to evaluate computer control agents on them. In their paper introducing the benchmark, Microsoft’s researchers also debuted a Windows computer control agent called Navi which successfully completed 19.5% of Agent Arena tasks—still a far cry from the 74.5% performance of an unassisted human. The benchmark is publicly available on GitHub, and has provisions for easy integration into Azure cloud to allow researchers to run hundreds of agents in parallel, cutting evaluation times from days to minutes.
Windows Agent Arena is not Microsoft’s first foray into the computer agent space: in February, the company released UI-Focused Agent (UFO), a multi-agent framework designed to accomplish computer control tasks, and has continued to update it since. In contrast to its competitors such as Apple and the various Linux developers, Microsoft appears to be betting that computer control agents will be a future selling point for its operating system, and the release of Windows Agent Arena further cements its lead.
If you find Building AI Agents valuable, forward this email to a friend or colleague!
📰 NEWS

Source: OpenAI
On Thursday, OpenAI released its much-hyped new AI system, OpenAI o1, touting its advanced reasoning capabilities. Rather than simply providing raw large language model (LLM) outputs, o1 operates as a simple agentic system, performing reflection-based reasoning on the backend before presenting its conclusions to users.
Salesforce has lately been making a “hard pivot” to AI agents in CEO Marc Benioff’s words. Now, the company has officially launched its Agentforce platform, allowing enterprise customers to build AI agents for sales, customer service, operations, and more.
AI agents have become a particularly hot topic in the cryptocurrency world, with proponents arguing for their potential as autonomous traders. Leading cryptocurrency exchange Coinbase announced that it would be releasing a Python software development kit (SDK) to facilitate agentic cryptocurrency transactions.
Enterprise cloud computing provider ServiceNow announced that new “Xanadu” customer service management and IT service management agents would be coming to its Now platform beginning in November.
🛠️ USEFUL STUFF

Source: LangChain
This free new class by LangChain teaches users how to build AI agents using the company’s LangGraph platform.
CrewAI, provider of the eponymous agent framework, has launched the CrewAI Developer Forum, a “space to discuss all things AI agents”.
LangChain is asking developers to complete a brief survey on their experiences building agents, offering a chance to win swag or a $250 gift card.
Adam Silverman, the co-founder and COO of Agency, provider of the AgentOps orchestration platform, provides a roundup of some of the most significant developments in the agent field over the past week.
LlamaIndex and Qdrant have teamed up to provide a webinar on building RAG agents with LlamaIndex utilizing Qdrant’s vector database.
💡 ANALYSIS

Source: ZDNET
The author of this piece interviews two executives at Salesforce on the coming integration of agents into enterprises’ operations, with some tasks being automated immediately but full replacement of employees taking much longer.
This article focuses on the use of AI agents to transact in goods and services, including logistics and inventory management, DeFi and cryptocurrency trading, and contract negotiation.
Prior generations of autonomous systems had the ability to react in pre-programmed ways to external events, but lacked human-like subjective judgement. This piece, the first in a series on the future of agents, argues that LLMs finally provide this capability, and reviews a few of the verticals they are transforming.
🧪 RESEARCH

Source: Locusive
This study by customer service chatbot provider Locusive benchmarks the performance of many commonly used open- and closed-source LLMs as reasoning engines for the company’s copilot agent, finding that Llama 3.1 70B performs best across the three crucial areas of accuracy, speed, and cost.
Sakana’s hyped and controversial AI Scientist was the most high-profile of the many recent attempts to develop AI agents capable of automating scientific reasearch. The authors of this paper recruited 100 natural language processing (NLP) researchers to write original research proposals and assigned blinded human raters to compare them to proposals created by LLM agents, finding that the agent-generated ideas were rated as more novel but less feasible.
This paper introduces DSBench, a benchmark of 466 data science and 74 data modeling tasks, and finds that the best agent successfully solves only 34%.
Thanks for reading! Until next time, keep learning and building!
If you have any specific feedback, just reply to this email—we’d love to hear from you
Follow us on X (Twitter) and LinkedIn