close
close

Solondais

Where news breaks first, every time

sinolod

Anthropic’s new AI model can control your PC

In a pitch to investors last spring, Anthropic said it intends to develop AI to power virtual assistants that can perform searches, respond to emails and manage applications themselves. other back office tasks. The company called this a “next generation algorithm for AI self-learning” – one that it says could, if all goes according to plan, one day automate much of the economy .

It’s taken a while, but this AI is starting to arrive.

Anthropic on Tuesday released an improved version of its Claude 3.5 Sonnet model that can understand and interact with any desktop application. Via a new “Computer Usage” API, now in open beta, the model can mimic keystrokes, button clicks and mouse gestures, essentially emulating a person sitting in front of a PC.

“We trained Claude to see what’s happening on a screen and then use available software tools to complete tasks,” Anthropic wrote in a blog post shared with TechCrunch. “When a developer instructs Claude to use computer software and gives him the necessary access, Claude looks at screenshots of what is visible to the user, then counts the number of pixels vertically or horizontally he has need to move a cursor in order to click on it.

Developers can try Computer Use through Anthropic’s API, Amazon Bedrock, and Google Cloud’s Vertex AI platform. The new Sonnet 3.5 without Computer Use is deployed on Claude applications and provides various performance improvements over the outgoing Sonnet 3.5 model.

Application automation

A tool that can automate tasks on a PC is not a new idea. Countless companies offer such tools, from decades-old RPA vendors to newcomers like Relay, Induced AI, and Automat.

In the race to develop so-called “AI agents,” the field has only become more crowded. AI agents remain an ill-defined term, but it generally refers to AI capable of automating software.

Some analysts say AI agents could provide an easier way for companies to monetize the billions of dollars they invest in AI. Businesses seem to agree: according to a recent Capgemini survey, 10% of organizations are already using AI agents and 82% will integrate them within three years.

This summer, Salesforce made splashy announcements about its AI agent technology, while Microsoft yesterday touted new tools for creating AI agents. OpenAI, which is developing its own brand of AI agents, sees the technology as a step toward super-intelligent AI.

Anthropic calls its AI agent concept an “action execution layer” that allows the new Sonnet 3.5 to execute commands at the desktop level. With its ability to browse the web (not a first for AI models, but a first for Anthropic), 3.5 Sonnet can use any website and any application.

Claudius 3.5 Sonnet newClaudius 3.5 Sonnet new

Anthropic’s new AI can control apps on a PC. Image credits:Anthropic

“Humans maintain control by providing specific prompts that direct Claude’s actions, such as ‘use my computer and online data to fill out this form,'” an Anthropic spokesperson told TechCrunch. “People allow access and limit access as needed. Claude breaks down user prompts into computer commands (e.g., move cursor, click, tap) to accomplish that specific task.”

Software development platform Replit used an early version of the new Sonnet 3.5 model to create an “autonomous checker” that can evaluate applications as they are built. Canva, meanwhile, says it’s exploring ways the new model could support the design and editing process.

But how is it different from other AI agents? It’s a reasonable question. Consumer gadget startup Rabbit is building a web agent that can do things like buy movie tickets online; Adept, recently acquired by Amazon, trains models to browse websites and navigate software; and Twin Labs uses commercially available models, including OpenAI’s GPT-4o, to automate office processes.

Anthropic claims that the new Sonnet 3.5 is simply a stronger, more robust model that can do better on coding tasks than even OpenAI’s flagship o1, according to the SWE-bench Verified benchmark. Although it has not been explicitly trained to do so, the enhanced Sonnet 3.5 automatically corrects itself and retries tasks when it encounters obstacles, and can achieve goals that require dozens or hundreds of steps.

Claudius 3.5 Sonnet newClaudius 3.5 Sonnet new

Performance of the new Claude 3.5 Sonnet model on different benchmarks. Image credits:Anthropic

But don’t fire your secretary just yet.

In an evaluation designed to test an AI agent’s ability to assist with airline reservation tasks, such as changing a flight reservation, the new Sonnet 3.5 was able to successfully complete less than half the tasks. In a separate test involving tasks such as initiating a return, 3.5 Sonnet failed about a third of the time.

Anthropic admits that the improved Sonnet 3.5 struggles with basic actions like scrolling and zooming, and that it may miss “short-lived” actions and notifications due to the way it takes screenshots. screen and brings them together.

“Claude’s computer use remains slow and often error-prone,” Anthropic wrote in its article. “We encourage developers to start exploring with low-risk tasks.”

Risky business

But is the new Sonnet 3.5 good enough to be dangerous? Maybe.

A recent study found that models without the ability to use desktop applications, like OpenAI’s GPT-4o, was poised to engage in harmful “multi-step agent behavior”, such as ordering a fake passport from someone on the dark web, when she was “attacked” using jailbreak techniques. Jailbreaks led to high success rates in performing harmful tasks, even for models protected by filters and backups, according to the researchers.

We can imagine how a model with access to the desktop could be interrupted more wreak havoc – for example, by exploiting application vulnerabilities to compromise personal information (or storing chats in plain text). In addition to the software levers it has, the model’s online and app connections could open the door to malicious jailbreakers.

Anthropic does not deny that there is a risk in releasing the new Sonnet 3.5. But the company says the benefits of observing how the model is used in nature ultimately outweigh this risk.

“We believe it is far better to provide access to computers in current, more limited and relatively safer models,” the company wrote. “This means we can begin to observe and learn from any potential problems that arise at this lower level, gradually and simultaneously expanding computer use and security measures.”

Claudius 3.5 Sonnet newClaudius 3.5 Sonnet new

Image credits:Anthropic

Anthropic also says it has taken steps to deter misuse, such as not training the new Sonnet 3.5 on screenshots and user prompts, and preventing the model from accessing the web during training. The company says it has developed classifiers to “keep” 3.5 Sonnet away from actions perceived as high-risk, such as posting on social media, creating accounts, and interacting with government websites.

As the US general election approaches, Anthropic says it is focused on mitigating abuse of its election-related models. The US AI Safety Institute and the UK Safety Institute, two separate but allied government agencies dedicated to assessing the risks of AI models, tested the new Sonnet 3.5 before its deployment.

Anthropic told TechCrunch that it has the ability to restrict access to additional websites and features “if necessary,” to protect against spam, fraud and misinformation, for example. As a security measure, the company keeps all screenshots captured by Computer Use for at least 30 days – a retention period that might alarm some developers.

We asked Anthropic under what circumstances, if any, it would release screenshots to a third party (e.g. law enforcement) if requested, and we will update this post if we receive an answer.

“There is no foolproof method, and we will continually evaluate and iterate our security measures to balance Claude’s capabilities with responsible use,” Anthropic said. “Those who use the computer version of Claude should take necessary precautions to minimize these types of risks, including isolating Claude from particularly sensitive data on their computer.”

Hopefully this will be enough to avoid the worst.

A cheaper model

Today’s headliner might have been the upgraded 3.5 Sonnet model, but Anthropic also said an updated version of Haiku, the cheapest and most efficient model in its Claude series, was in the works. road.

Claude 3.5 Haiku, expected in the coming weeks, will match the performance of Claude 3 Opus, once Anthropic’s cutting-edge model, on some benchmarks at the same cost and “approximate speed” of Claude 3 Haiku.

“With fast speeds, improved instruction tracking, and more precise tool usage, Claude 3.5 Haiku is well-suited for user-facing products, specialized sub-agent tasks, and generating personalized experiences from “huge volumes of data such as purchase history, pricing or inventory data,” Anthropic wrote in a blog post.

3.5 Haiku will initially be available as a text-only model, then as part of a multimodal package capable of analyzing both text and images.

Claude 3.5 HaikuClaude 3.5 Haiku

3.5 Haiku benchmark performance. Image credits:Anthropic

So once version 3.5 Haiku is available, will there be many reasons to use version 3 Opus? What about 3.5 Opus, the successor to 3 Opus, which Anthropic teased last June?

“All models in the Claude 3 family have their individual uses for customers,” the Anthropic spokesperson said. “Claude 3.5 Opus is on our roadmap and we will be sure to share more as soon as possible.”