Building AI Agents
Posts
Microsoft introduces Windows Agent Arena

Microsoft introduces Windows Agent Arena

A new benchmark for Windows control agents

Michael Cunningham
September 16, 2024

Welcome back to Building AI Agents, your weekly guide to everything new in the AI agent field!

AI agents are now on the verge of being used as podcast hosts. Would you tune into The Joe Robot Experience?

In today’s issue…

Microsoft introduces a benchmark for Windows control agents
Salesforce debuts its Agentforce platform
CrewAI launches a new forum for agent developers
The AI agents making an impact in enterprises now
A study on the best LLMs for powering agents

…and more

🔍 SPOTLIGHT

Source: Microsoft

In the 1990s, Microsoft Windows nearly took over the computing world. Now, Microsoft’s computer agents are trying to take over Windows.

Computer agents, or computer control agents, are a type of AI agent designed to accomplish tasks in a desktop environment, taking the computer’s screen as an input and outputting actions such as scrolling, clicking, and entering text. Such agents potentially have enormous economic value, promising to save PC users significant amounts of time by automating rote, low-skill tasks. As a result, they have attracted considerable interest from major AI players, including tech giants such as Google and OpenAI, scrappy startups like Adept, and academic institutions. In May, a collaboration between the University of Hong Kong, Carnegie Mellon University, Salesforce Research, and the University of Waterloo produced OSWorld, a benchmark of 369 computer control tasks across Linux, Windows, and macOS.

Now, Microsoft is entering the arena (so to speak) with Windows Agent Arena, a benchmark which expands on OSWorld by adding a significant number of tasks for the Windows operating system. Windows Agent Arena consists of 154 challenges in the Office, Web Browsing, Windows System, Coding, Media & Video, and Windows Utilities domains, as well as a virtual environment to evaluate computer control agents on them. In their paper introducing the benchmark, Microsoft’s researchers also debuted a Windows computer control agent called Navi which successfully completed 19.5% of Agent Arena tasks—still a far cry from the 74.5% performance of an unassisted human. The benchmark is publicly available on GitHub, and has provisions for easy integration into Azure cloud to allow researchers to run hundreds of agents in parallel, cutting evaluation times from days to minutes.

Windows Agent Arena is not Microsoft’s first foray into the computer agent space: in February, the company released UI-Focused Agent (UFO), a multi-agent framework designed to accomplish computer control tasks, and has continued to update it since. In contrast to its competitors such as Apple and the various Linux developers, Microsoft appears to be betting that computer control agents will be a future selling point for its operating system, and the release of Windows Agent Arena further cements its lead.

If you find Building AI Agents valuable, forward this email to a friend or colleague!

📰 NEWS

Source: OpenAI

OpenAI’s o1 is an AI agent

On Thursday, OpenAI released its much-hyped new AI system, OpenAI o1, touting its advanced reasoning capabilities. Rather than simply providing raw large language model (LLM) outputs, o1 operates as a simple agentic system, performing reflection-based reasoning on the backend before presenting its conclusions to users.

Salesforce debuts Agentforce platform

Salesforce has lately been making a “hard pivot” to AI agents in CEO Marc Benioff’s words. Now, the company has officially launched its Agentforce platform, allowing enterprise customers to build AI agents for sales, customer service, operations, and more.

Coinbase to introduce a Python SDK for crypto trading agents

AI agents have become a particularly hot topic in the cryptocurrency world, with proponents arguing for their potential as autonomous traders. Leading cryptocurrency exchange Coinbase announced that it would be releasing a Python software development kit (SDK) to facilitate agentic cryptocurrency transactions.

ServiceNow “Xanadu” brings AI agents to Now Platform

Enterprise cloud computing provider ServiceNow announced that new “Xanadu” customer service management and IT service management agents would be coming to its Now platform beginning in November.

🛠️ USEFUL STUFF

Source: LangChain

A course in agent-building by LangChain

This free new class by LangChain teaches users how to build AI agents using the company’s LangGraph platform.

CrewAI’s launches new forum for agent developers

CrewAI, provider of the eponymous agent framework, has launched the CrewAI Developer Forum, a “space to discuss all things AI agents”.

LangChain solicits feedback from AI agent builders

LangChain is asking developers to complete a brief survey on their experiences building agents, offering a chance to win swag or a $250 gift card.

Agency COO rounds up the latest in agents

Adam Silverman, the co-founder and COO of Agency, provider of the AgentOps orchestration platform, provides a roundup of some of the most significant developments in the agent field over the past week.

A webinar on agent-building with LlamaIndex

LlamaIndex and Qdrant have teamed up to provide a webinar on building RAG agents with LlamaIndex utilizing Qdrant’s vector database.

💡 ANALYSIS

Source: ZDNET

Early adopters are deploying AI agents in the enterprise now

The author of this piece interviews two executives at Salesforce on the coming integration of agents into enterprises’ operations, with some tasks being automated immediately but full replacement of employees taking much longer.

What are Autonomous Economic Agents?

This article focuses on the use of AI agents to transact in goods and services, including logistics and inventory management, DeFi and cryptocurrency trading, and contract negotiation.

Agentic AI: A deep dive into the future of automation

Prior generations of autonomous systems had the ability to react in pre-programmed ways to external events, but lacked human-like subjective judgement. This piece, the first in a series on the future of agents, argues that LLMs finally provide this capability, and reviews a few of the verticals they are transforming.

🧪 RESEARCH

Source: Locusive

The best LLMs for autonomous agents: a comparison across accuracy, speed, and cost

This study by customer service chatbot provider Locusive benchmarks the performance of many commonly used open- and closed-source LLMs as reasoning engines for the company’s copilot agent, finding that Llama 3.1 70B performs best across the three crucial areas of accuracy, speed, and cost.

Can LLMs generate novel research ideas?

Sakana’s hyped and controversial AI Scientist was the most high-profile of the many recent attempts to develop AI agents capable of automating scientific reasearch. The authors of this paper recruited 100 natural language processing (NLP) researchers to write original research proposals and assigned blinded human raters to compare them to proposals created by LLM agents, finding that the agent-generated ideas were rated as more novel but less feasible.

How close are data science agents to becoming data science experts?

This paper introduces DSBench, a benchmark of 466 data science and 74 data modeling tasks, and finds that the best agent successfully solves only 34%.

Thanks for reading! Until next time, keep learning and building!

If you have any specific feedback, just reply to this email—we’d love to hear from you