Source: Xinzhiyuan
Claude 3.5 received a major upgrade late at night!
As expected, Anthropic AI finally made a big move this week - the first release of Claude 3.5 Haiku, and the new upgraded version of Claude 3.5 Sonnet is also here.
However, the "Super Large Cup" Opus still did not appear.
What is amazing is that the evolved Claude 3.5 Sonnet defeated OpenAI o1 in one fell swoop, making it the strongest inference model.
It has been significantly improved in all aspects, especially its industry-leading coding capabilities.
The Claude 3.5 Haiku has the same performance as the previous generation’s most powerful Claude 3 Opus, but the cost , the speed is similar to the previous generation Haiku.
Even now, Claude can operate the computer like a human. Not only can he view the screen and move the cursor, but he can also press buttons and type text!
The director of developer relations at Anthropic said that "computer use" is the first step in a new human-computer interaction paradigm. At the same time, it is also a new basic capability that AI models should have.
Many startups building browser intelligence became obsolete overnight.
Netizens lamented: Agents and workflows are about to change...
Can you use computer AI yourself?
In public beta, Anthropic introduces a groundbreaking new feature: computer usage capabilities. Starting today, developers can use APIs to guide Claude to use computers like a human.
The Claude 3.5 Sonnet is the first model to offer this feature in public beta.
Of course, this feature is still in the experimental stage and is a bit clumsy to use and may go wrong. Anthropic chose to release this feature in advance in order to obtain feedback from developers and improve it quickly.
Why should AI be trained to operate computers?
Anthropic said that in the past few years, the development of powerful AI has reached many milestones, such as the ability to perform complex logical reasoning and recognize and understand images.
The next breakthrough point is AI operating computers! It would be a sign of the future if models did not have to interact with custom-made tools, but were directed to use all software.
Basic computer operations
In this demo, Anthropic researcher proposed an extremely difficult challenge to Claude:
< p>My friend is coming to San Francisco and I want to watch the sunrise on the Golden Gate Bridge with him tomorrow morning. We will depart from Pacific Highlands. Can you help us find a great viewing spot, check drive times and sunrise times, and schedule a calendar event that will allow us enough time to get there?
Claude opened Google on his own and started searching.
How far is the Golden Gate Bridge from the user’s residence? Claude will open the map himself to find the distance.
After understanding the required information, it opens the calendar and arranges it for the owner schedule.
Automatic coding to write websites
The developer showed how Claude controlled his laptop and completed a website programming task smoothly.
First, Claude navigated to Claude.ai in his brother’s Chrome browser, and asked Claude to create a 90s-themed personal homepage for himself.
I saw it entering the URL, typing prompts, and making a request to another Claude.
Claude.ai returned some codes, and the rendered picture looks very good , but I hope to make some modifications to the website locally on my computer.
So he asked Claude to download the file and then open it in VS Code. Claude successfully completed these instructions.
Then the little brother asked Claude to start a server, and then he could actually view the file in the browser.
Claude opened the VS Code terminal and tried to start a server, but then encountered an error: Python was not installed on the machine.
As a result, by looking at the terminal output, Claude discovered the problem himself! It tried again with Python 3 and successfully started the server.
However, there is an error in the terminal output, and a file icon is missing at the top . The developer asked Claude to identify this error and fix it in the file.
Surprisingly, Claude found the line that caused the error in VS Code, deleted the entire line, then saved the file and re-run the website.
This time, the website is completely correct!
Automatically search for data and fill in the form
Suppose we need to fill in a supplier request form from "Ant Equipment Company", but the data that needs to be filled in is scattered In every corner of the computer, can Claude help us complete it?
It started to take screenshots of my brother’s screenshots, and soon discovered that Ant Equipment Company was not in the table.
At this time, it immediately switches to the CRM system to search for this company. Once it found it, it started scrolling, looking for all the information it needed to fill out the form, and then submitted the form.
This means that many of the tedious things we have to do at work can be left to Claude!
Now, this function is available in the API.
Now, many well-known companies, such as Asana, Canva, Cognition, DoorDash, Replit and The Browser Company, are already exploring Claude’s new potential, allowing them to perform complex tasks with dozens or even hundreds of steps. .
For example, Replit is leveraging Claude 3.5 Sonnet's computer usage and user interface navigation capabilities to develop functionality for the Replit Agent, which can be evaluated in real time as the application is being built.
Far lower than humans, but promising in the future
What is the computer usage capability of the newly upgraded Claude 3.5 Sonnet?
In the OSWorld test, it scored 14.9% in the task category based solely on screenshots, significantly surpassing the second-ranked AI system (7.8%).
When more steps were allowed to complete the task, Claude's score improved to 22.0%.
This shows that multiple interactions between the model and the environment can optimize task performance.
Although this result is significantly improved than before, it is still far lower than the human performance of 72.36%.
This also implies that Claude 3.5 Sonnet still has a lot of room for improvement in the future.
After all, some operations that humans complete effortlessly (scrolling, dragging, Zoom), which is currently extremely challenging for Claude.
Upgraded version of Claude 3.5 Sonnet, the coding king defeated o1
In various industry benchmark tests, the performance of the upgraded version of Claude 3.5 Sonnet has been improved in all aspects.
In particular, significant breakthroughs have been made in agent coding and tool usage tasks.
Paper address: https:/// /assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf
In terms of coding capabilities, it performed from 33.4 in the SWE-bench Verified test % increased significantly to 49.0%.
This exceeds all publicly available models - including inference models such as OpenAI o1-preview and specialized systems designed for agent coding.
In addition, in TAU-bench (an assessment of the agent's ability to use tools Benchmark test), the Claude 3.5 Sonnet also performed well:
The score in the retail area increased from 62.6% to 69.2%, and in the more recent The challenging aviation sector jumped from 36.0% to 46.0%.
As can be seen from the table below On the inference test benchmark GPQA (Diamond), the new version of Claude 3.5 Sonnet significantly surpasses GPT-4o.
In visual QA, mathematical reasoning, document visual Q&A, chart Q&A, scientific tables In the benchmark test, the performance of Claude 3.5 Sonnet has become a new benchmark in the industry.
It is worth mentioning that while the new version of Claude 3.5 Sonnet has a performance breakthrough, it still maintains At the same price and running speed as the previous model.
Feedback from some early test users further confirms that the upgraded Claude 3.5 Sonnet has achieved a "qualitative" leap in the field of AI-driven coding.
GitLab:In the DevSecOps task test, it was found that Claude 3.5 Sonnet can perform inference without increasing latency. Significantly improved capabilities (up to 10% improvement for each use case), making it ideal for driving complex software development processes
Cognition: The new version of Claude 3.5 Sonnet is used in autonomous AI evaluation, and has made substantial progress compared with previous models in terms of coding, planning, and problem solving
The Browser Company: When using this model to automate network workflows, it was found that Claude 3.5 Sonnet outperformed all models they had previously tested
In addition, before secure deployment , Claude 3.5 Sonnet has been jointly tested by the United States AI Security Institute (US AISI) and the British Security Institute (UK AISI).
Moreover, after its own evaluation, the ASL-2 standard formulated by Anthorpic in the "Responsible Scaling Policy" is still applicable to the new model.
As mentioned before, the upgraded version of Claude 3.5 Sonnet can now be used on web pages and terminal APPs.
API pricing starts at$3 per million input tokens , $15 per million output tokens.
Save up to 90% of costs by using smart caching technology, and 50% by using batch APIs.
Application Scenarios
Claude 3.5 Sonnet can understand subtle instructions and context, identify and correct its own errors, and generate in-depth analysis from complex data and insights. Combining state-of-the-art coding, visual recognition and writing capabilities, Claude 3.5 Sonnet can be used in a variety of scenarios.
- Simulate human operation of the computer
By integrating Claude through the API, developers can guide Claude to use the computer like a human - by observing the screen, moving the mouse, Click buttons and type text. Claude 3.5 Sonnet is the first cutting-edge AI model that can reliably use computers in this way, and while it is still experimental in public beta, its capabilities will continue to improve over time.
- Automatic code generation
Claude 3.5 Sonnet can assist with the entire software development life cycle - from initial design to bug fixing, from system maintenance to performance optimization . It can be integrated directly into products or used as an intelligent coding assistant through the Claude.ai platform.
- Intelligent dialogue system
With enhanced reasoning capabilities and friendly, natural tone, Claude 3.5 Sonnet is ideal for developers who need to connect data and execute across systems Intelligent dialogue system for operation.
- Intelligent knowledge question and answer
Claude 3.5 Sonnet has large-scale context processing capabilities and extremely low hallucination rate, making it ideal for processing large knowledge bases and documents Ideal for Q&A tasks on and code bases.
- Visual information extraction
Claude 3.5 Sonnet can easily extract information from visual materials such as charts, graphs and complex schematics - making it data Ideal AI model for analytics and data science tasks.
- Process Automation
Claude 3.5 Sonnet enables the automation of repetitive tasks or processes. It has industry-leading command execution capabilities and can handle complex processes and operations.
The new Claude 3.5 Haiku is smarter than its older brother
Judging from the benchmark of the previous generation, the Claude 3.5 Haiku can be called the “smallest cup”.
This is Anthropic's fastest model.
It not only maintains the same running cost and similar processing speed as Claude 3 Haiku, but also comprehensively improves various skills.
Even, in several intelligent benchmark tests, Claude 3.5 Haikusurpassed the most powerful model of the previous generation, Claude 3 Opus.
Similarly, Claude 3.5 Haiku performed particularly well on coding tasks.
For example, in the SWE-bench Verified test, it achieved a high score of 40.6%, surpassing many AI agents using publicly available state-of-the-art models - including Original versions ofClaude 3.5 Sonnet and GPT-4o.
Claude 3.5 Haiku has three outstanding advantages:
1. Low latency response
2. More accurate command execution capability
3. More accurate Tool usage
These features make the model particularly suitable for user-oriented product development, specialized sub-agent task processing, and generation based on massive data (such as purchase records, price information, or inventory data) Personalized experience.
At the end of this month, Claude 3.5 Haiku will be available on multiple platforms, including Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. (Initially it will be launched as a plain text model, and image input functionality will be added later)
Claude 3.5 Haiku’s pricing starts at $0.25 per million input tokens, per One million output tokens are $1.25.
You can save up to 90% of the cost by using prompt word caching technology, and 50% of the cost by using the message batching API.
Application Scenarios
With fast processing speed, improved command execution capabilities and more accurate tool usage, Claude 3.5 Haiku is very suitable for user-oriented products , specialized auxiliary tasks, and generating personalized experiences from massive amounts of data.
- Code automatic completion
Claude 3.5 Haiku can provide fast and accurate code suggestions and completion, effectively accelerating the development workflow. Especially suitable for software development teams who want to simplify the coding process and increase productivity.
- Intelligent Chatbot
With enhanced conversational capabilities and fast response times, Claude 3.5 Haiku drives responsive chat that can handle high volumes of user interaction Excellent performance in robotics. It is especially valuable for customer service, e-commerce and education platforms that require scalable interactive capabilities.
- Data extraction and automatic annotation
Claude 3.5 Haiku can efficiently process and classify information, and performs well in fast data extraction and automatic annotation tasks. This capability is particularly useful for organizations that need to process large amounts of unstructured data in finance, healthcare, and research.
- Automated real-time content moderation
Claude 3.5 Haiku provides reliable, instant content moderation services through its improved reasoning and content understanding capabilities. This is extremely valuable for social platforms, online communities and media organizations that need to maintain safe, appropriate content at scale.
How to teach Claude to operate a computer
Anthropic said that operations that humans can easily perform - scrolling, dragging, and zooming are still very challenging for Claude.
For risks such as spam, false information, and fraud, companies are looking for safe deployment strategies, such as developing identification systems to detect whether harm has occurred.
Research process
Anthropic’s work on tool use and multi-modality has laid the foundation for AI recognition and interpretation of images.
On this basis, Claude also needs to reason about how and when to perform actions based on the screen content.
To do this, the researchers trained Claude to accurately count pixels to complete commands, as it had to figure out how many pixels it needed to move the mouse pointer vertically or horizontally to click in the correct location.
During this period, Claude quickly successfully transferred his learning from training on simple software such as calculators and text editors to other applications (note that the Internet was not allowed during this period).
This kind of training allows it to convert user instructions into a series of logical steps to perform operations. It can even self-correct and retry the task when it encounters obstacles.
Interlude
Alex Albert, director of developer relations at Anthropic, also shared an interesting story about the team’s development of computer usage features.
At that time, they held a bug bash (bug bash) for engineers to ensure that all potential problems with the API were discovered.
This means locking a group of engineers in a room for several hours.
At that time, everyone happened to be hungry. One of the engineers had an idea, "How about letting Claude do a practical exercise and open DoorDash to order food for us."
Unexpectedly, about a minute later, Claude ordered pizza for the engineers .
Looking to the future
The ability of AI to operate computers represents a new approach to artificial intelligence development.
To date, LLM developers have been working hard to adapt the tools to the model, creating special environments that allow AI to use specially designed tools to complete various tasks.
Now, Anthropic "does the opposite" - they choose to let the model adapt to the tool. In other words, Claude can integrate into the computer environment we use every day and use existing software directly, just like humans.
Although Claude has reached the current top level, its operation is still relatively slow and error-prone. Claude is not yet able to perform many of the operations we use daily on computers, such as dragging and zooming.
In addition, the way Claude currently observes the screen is similar to quickly flipping through a "picture album" - by taking consecutive screenshots and stitching them together, rather than observing a continuous video stream. This means it may miss some brief actions or notifications.
Interestingly, Anthropic also encountered some interesting episodes when recording the demo.
For example, during a demonstration, Claude accidentally clicked to stop a long-running screen recording, causing all the recordings to be wasted.
During another coding demonstration, Claude suddenly "distracted" and began browsing photos of Yellowstone National Park with great interest.
In short, Claude’s current performance makes people full of expectations for the future: AI operates computers Its capabilities will improve rapidly, and one day, software development novices can easily use it.