Text vs. visual agents—which will win?

Plus: 10 questions that tell you if you're ready for AI agents, a template for building agents with Google's latest model, and more

Welcome back to Building AI Agents, your biweekly guide to everything new in the field of agentic AI!

In a few years, this tweet will read “Pretty insane that this commercial that aired during the NBA Finals was actually made by a human.”

In today’s issue…

  • The two types of AI agents, and which will prevail

  • Databricks’ new platform builds agents with just a description

  • 🎉 7 days to our course and community launch

  • 10 questions that tell you if you’re ready for agents

  • What agent saves you the most time each week?

…and more

🔍 SPOTLIGHT

OpenAI’s Operator computer use agent in action | Source: OpenAI

It takes vision to see the future, and vision isn’t something all AI agents have.

When we think of agents, we usually think of one particular type: text agents. These came first, because they were the most natural application of large language models (LLMs). Give a language model a task description in natural language and some relevant tools that it can invoke by outputting a special sequence of characters, and voila, you have a text agent.

As the agent carries out its job, its task description, inputs and outputs from tools, thought process, and everything else are represented as a sequence of text that is fed into the LLM’s context window and gradually added on to. For an LLM, this paradigm works perfectly—it’s exactly how LLMs “see” the world.

But it is also their main drawback: the world of computer operating systems and web pages is set up for humans, not LLMs, and humans don’t just perceive the world through endless strings of text, we perceive it visuospatially through images. The very first computers may have had purely textual interfaces, but since the 1980s, virtually all of them have used graphical ones (GUIs)—and so does the modern internet. Try setting up a web page with nothing but rows of text and see how people like it.

This, of course, means that many of the systems that humans use on a day-to-day basis are totally inaccessible to text agents. There are workarounds, like APIs and other integrations that enable text agents to interact with these applications, but for a variety of reasons, most websites and pieces of software don’t have them—yet.

Enter the second major type: computer use agents (CUAs), aka GUI agents. Unlike their textual cousins, these agents can actually see the world of computer windows and web pages, perceiving the relationship between different elements and allowing them—theoretically—to accomplish all the same tasks on a computer that a human could, and without the need for specialized integrations.

No wonder, then, that major companies like OpenAI, Anthropic, Microsoft, and Google are all putting significant effort into developing their own versions. Browser agents, a subtype of CUAs that operate only through web browsers, are also attracting a great deal of attention.

So why haven’t CUAs taken over yet, if they’re so much more versatile? The main factor, up to this point, has simply been performance—LLMs’ visual capabilities lag behind their textual ones, and model providers haven’t yet figured out how to make them “think” coherently over sequences of visual inputs the way they can over text.

Text agents are also getting a boost. As software companies and web developers wake up to the critical need to make their products interface with agentic AI, they’ve started to build in integrations that allow them to connect with text agents—essential given that these are the vast majority of agents in existence today.

Thus, CUA builders are working to make agents that can access the human world, while software developers are working to make the human world accessible to (text) agents.

Where does this end? In the long run, I think CUAs will win out. No matter how much we change our software to make it usable for text agents, at the end of the day, there are too many tasks where vision is essential. In the meantime, though, text agents are much cheaper and more reliable—as long as they have the ability to interact with the software they’re using.

As I wrote in last week’s essay, integrations for text agents are now a critical battleground—but the rise of CUAs adds a caveat: critical for now. In a future where CUAs get good enough, these integrations won’t matter.

Anyone but a text agent can see that.

🎉 T-MINUS 7 DAYS TO LAUNCH

Course & Community coming Monday, 6/23

More than 100 of you have joined the waitlist 🙌

Why Join the Waitlist? 🤔

  • 🎁 First 20 to join get FREE access for life - All spots have been filled

  • 🎁 Waitlist founding-members lock-in $25 a month for life (only pay when you join, cancel anytime)

  • 🚨 Price increases to $50 a month at launch

What You Get Access to 🙂

  • Hands-on course starting with no-code builds using n8n — our favorite framework for non-coders and coders

  • Weekly sessions (live + recorded) so you can ask questions at your own pace

  • Guest speakers to talk what others are building & the future of agents

  • Monthly build competitions with cash prizes 🧠💰

  • New content & agent templates every month

  • Private community to get help, swap wins, and share what you’re building

Secure your place on the waitlist and learn to build AI agents with us!

Subscribe to keep reading

This content is completely free, but you must be subscribed to Building AI Agents to continue reading.

I consent to receive newsletters via email. Terms of use and Privacy policy.

Already a subscriber?Sign in.Not now