Translated by Jiayu, Cage
AI Agent is a paradigm shift that we closely track, and a series of articles by Langchain are very helpful in understanding the development trend of Agent. In this compilation, the first part is the State of AI Agent report released by the Langchain team. They interviewed more than 1,300 practitioners, including developers, product managers, and company executives, and revealed the current status and landing bottlenecks of Agent this year: Nine out of ten companies have plans and needs for AI Agent, but the limitations of Agent capabilities mean that users can only implement it in a few processes and scenarios. Compared with cost and latency, people care more about the improvement of Agent capabilities and the observability and controllability of its behavior.
In the second part, we compiled the analysis of the key elements of AI Agent in the In the Loop series of articles on the LangChain official website: Planning ability, UI/UX interaction innovation and memory mechanism. The article analyzes the interaction methods of 5 LLM-native products and compares 3 complex human memory mechanisms, which is inspiring for understanding AI Agent and these key elements. In this part, we also added some representative Agent company case studies, such as the interview with the founder of Reflection AI, to look forward to the key breakthroughs of AI Agent in 2025.
Under this analytical framework, we expect AI Agent applications to begin to emerge in 2025 and enter a new paradigm of human-machine collaboration. For the planning ability of AI Agent, models led by o3 are showing strong reflection and reasoning capabilities, and the progress of model companies is approaching the Agent stage from reasoner. As reasoning capabilities continue to improve, the "last mile" of Agents will be product interaction and memory mechanisms, which may be an opportunity for startups to break through. Regarding interaction, we have been looking forward to the "GUI moment" of the AI era; regarding memory, we believe that Context will become the keyword for Agent implementation. Personalized context at the individual level and unified context at the enterprise level will greatly improve the product experience of Agents.
01. Agent usage trends:
Every company is planning to deploy Agents
Competition in the Agent field is becoming fierce. In the past year, many Agent frameworks have become popular: for example, using ReAct combined with LLM for reasoning and action, using multi-agent frameworks for orchestration, or using more controllable frameworks like LangGraph.
Discussions about Agents are not all hype on Twitter. About 51% of respondents are currently using agents in production. According to Langchain data by company size, mid-sized companies with 100-2000 employees are the most active in putting agents into production, accounting for 63%.
In addition, 78% of respondents have plans to adopt agents into production in the near future. It is clear that there is a strong interest in AI agents, but actually making a production-ready agent is still a challenge for many.
Although the technology industry is often considered an early adopter of agents, interest in agents is growing across all industries. Among respondents working in non-tech companies, 90% have or plan to put agents into production (almost the same as the proportion of technology companies, 89%).
Common use cases of agents
The most common use cases of agents include conducting research and summarizing (58%), followed by streamlining workflows through customized agents (53.5%).
These reflect people's desire for products to handle tasks that are too time-consuming. Users can rely on AI agents to extract key information and insights from large amounts of information, rather than filtering through massive amounts of data themselves and then reviewing or researching and analyzing the data. Similarly, AI agents can improve personal productivity by assisting with daily tasks, allowing users to focus on important matters.
Not only do individuals need this efficiency improvement, but companies and teams do too. Customer service (45.8%) is another major application area for agents, where agents help companies handle inquiries, troubleshoot, and speed up customer response times across teams; fourth and fifth place are lower-level code and data applications.
Monitoring: Agent applications require observability and controllability
As agent implementations become more powerful, methods for managing and monitoring agents are needed. Tracing and observability tools top the list of must-haves, helping developers understand agent behavior and performance. Many companies also use guardrails to prevent agents from going off track.
When testing LLM applications, offline evaluation (39.8%) is more commonly used than online evaluation (32.5%), reflecting the difficulty of monitoring LLM in real time. In the open-ended responses provided by LangChain, many companies also have human experts manually review or evaluate responses as an additional layer of prevention.
Despite the enthusiasm for agents, people are generally conservative about agent permissions. Few respondents allow their agents to read, write, and delete freely. Instead, most teams only allow tool permissions with read permissions, or require human approval for agents to do riskier actions such as write or delete.
Companies of different sizes also have different priorities when it comes to agent control. Unsurprisingly, large enterprises (2,000+ employees) are more cautious and rely heavily on "read-only" permissions to avoid unnecessary risks. They also tend to combine guardrail protection with offline assessments and don't want customers to see any problems.
Meanwhile, small companies and startups (less than 100 employees) are more focused on tracking to understand what happened in their Agent applications (rather than other controls). According to LangChain's survey data, smaller companies tend to focus on understanding results by looking at data; while large enterprises set more controls across the board.
Obstacles and Challenges of Putting Agents into Production
Ensuring high-quality performance of LLM is difficult, and the answer requires high accuracy and the right style. This is the top concern for Agent developers—more than twice as important as other factors such as cost and security.
LLM Agents are probabilistic content outputs, which means they are highly unpredictable. This introduces more potential for error, making it difficult for teams to ensure their Agents consistently provide accurate, contextual responses.
This is especially true for small companies, where performance quality far outweighs other considerations, with 45.8% citing it as a primary concern, while cost (the second-largest concern) is only 22.4%. This gap emphasizes the importance of reliable, high-quality performance for organizations to move Agents from development to production.
Security issues are also prevalent for large companies that need to strictly comply with regulations and sensitively handle customer data.
The challenge is not just about quality. From the open answers provided by LangChain, many people are still skeptical about whether the company should continue to invest in the development and testing of agents. They mentioned two prominent obstacles: developing agents requires a lot of knowledge and needs to keep up with the forefront of technology; developing and deploying agents requires a lot of time and cost, and the benefits of reliable operation are uncertain.
Other Emerging Topics
In open problems, there was much praise for the capabilities demonstrated by AI agents:
•Managing multi-step tasks: AI agents are able to perform deeper reasoning and context management, enabling them to handle more complex tasks;
•Automating repetitive tasks: AI agents continue to be seen as key to automating tasks, which can free up users’ time to solve more creative problems;
•Task planning and collaboration: Better task planning ensures that the right agent is working on the right problem at the right time, especially in multi-agent systems;
•Human-like reasoning: Unlike traditional LLMs, AI agents can trace their decisions back, including looking back and modifying past decisions based on new information.
In addition, there are two most anticipated developments:
• Expectations for open source AI agents: People are obviously interested in open source AI agents, and many people mentioned that collective intelligence can accelerate the innovation of agents;
• Expectations for more powerful models: Many people are looking forward to the next leap of AI agents driven by larger and more powerful models—at that time, agents can handle more complex tasks with higher efficiency and autonomy.
Many people in the Q&A also mentioned the biggest challenge in agent development: how to understand the behavior of agents. Some engineers mentioned that they would encounter difficulties in explaining the capabilities and behaviors of AI agents to company stakeholders. Sometimes visualization plug-ins can help explain the behavior of agents, but in more cases LLM is still a black box. The additional burden of explainability is left to the engineering team.
02.Core Elements in AI Agent
What is an Agentic System
Before the State of AI Agent report was released, the Langchain team had written its own Langraph framework in the Agent field and discussed many key components in AI Agent through the In the Loop blog. The following is our compilation of the key content.
First of all, everyone has a slightly different definition of AI Agent. LangChain founder Harrison Chase gave the following definition:
AI Agent is a system that uses LLM to make control flow decisions for the program.
An AI agent is a system that uses an LLM to decide the control flow of an application.
For its implementation, the article introduces the concept of Cognitive architecture, which refers to how the Agent thinks and how the system arranges code/ prompt LLM:
• Cognitive: Agent uses LLM to semantically reason about how to arrange code/ Prompt LLM;
• Architecture: These Agent systems still involve a lot of engineering similar to traditional system architecture.
The picture below shows examples of cognitive architecture at different levels:
• Standardized software code: Everything is Hard Code, and the relevant parameters of output or input are fixed directly in the source code. This does not constitute a cognitive architecture because there is no cognitive part;
• LLM Call, except for some data preprocessing, the call of a single LLM constitutes most of the application. Simple Chatbot belongs to this category;
• Chain: A series of LLM calls. Chain tries to divide the solution of the problem into several steps and calls different LLMs to solve the problem. Complex RAGs belong to this category: the first LLM is called to search and query, and the second LLM is called to generate the answer;
• Router: In the previous three systems, the user can know in advance all the steps that the program will take, but in Router, LLM decides on its own which LLM to call and what steps to take, which adds more randomness and unpredictability;
• State Machine, combining LLM with Router, which will be even more unpredictable, because this combination is put into a loop, and the system can (theoretically) make an unlimited number of LLM calls;
• Agentic system: Also called "Autonomous Agent", when using State Machine, there are still restrictions on what actions can be taken and what processes are executed after the action is performed; but when using Autonomous Agent, these restrictions will be removed. LLM decides which steps to take and how to orchestrate different LLMs, which can be done by using different prompts, tools or codes.
In short, the more "Agentic" a system is, the more LLM determines how the system behaves.
Key Elements of Agents
Planning
Agent reliability is a big pain point. Companies often build agents using LLMs, but mention that agents cannot plan and reason well. What do planning and reasoning mean here?
Agent planning and reasoning refers to the ability of LLM to think about what actions to take. This involves short-term and long-term reasoning. LLM evaluates all available information and then decides: what series of steps do I need to take, and which is the first step I should take now?
Many times developers use Function calling to let LLM choose an action to perform. Function calling is a capability that OpenAI first added to the LLM api in June 2023. With Function calling, users can provide JSON structures for different functions and have LLM match one (or more) of these structures.
To successfully complete a complex task, the system needs to take a series of actions in sequence. This long-term planning and reasoning is very complex for LLM: first, LLM must consider a long-term action plan and then return to the short-term actions to take; second, as the Agent performs more and more actions, the results of the actions will be fed back to LLM, causing the context window to grow, which may cause LLM to be "distracted" and perform poorly.
The easiest solution to improve planning is to ensure that LLM has all the information it needs to properly reason/plan. Although this sounds simple, often the information passed to the LLM is simply not enough for the LLM to make a reasonable decision, and adding a retrieval step or clarifying the prompt may be a simple improvement.
After that, you can consider changing the cognitive architecture of the application. There are two types of cognitive architectures to improve reasoning, general cognitive architectures and domain-specific cognitive architectures:
1. General cognitive architecture
General cognitive architecture can be applied to any task. There are two general architectures proposed in two papers, one is the "plan and solve" architecture, proposed in Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models, in which the agent first proposes a plan and then executes each step in the plan. Another general architecture is the Reflexion architecture, proposed in Reflexion: Language Agents with Verbal Reinforcement Learning, in which the agent has an explicit "reflection" step after performing a task to reflect whether it performed the task correctly. I won't go into details here, but you can refer to the above two papers for details.
Although these ideas show improvements, they are often too general to be actually used by agents in production. (Translator's note: There was no o1 series model when this article was published)
2. Domain-specific cognitive architecture
Instead, we see that agents are built using domain-specific cognitive architectures. This is often manifested in domain-specific classification/planning steps, domain-specific verification steps. Some ideas from planning and reflection can be applied here, but they are often applied in a domain-specific way.
A specific example is given in a paper from AlphaCodium: state-of-the-art performance was achieved by using what they call "flow engineering" (another way of talking about cognitive architectures).
You can see that the agent's flow is very specific to the problem they are trying to solve. They tell the agent what to do in steps: come up with a test, then come up with a solution, then iterate on more tests, etc. This cognitive architecture is highly domain-specific and does not generalize to other domains.
Case Study:
Reflection AI founder Laskin Vision of the future of Agents
In Sequoia Capital’s interview with Reflection AI founder Misha Laskin, Misha mentioned that he is beginning to realize his vision: that is, to build the best Agent model in his new company Reflection AI by combining RL’s Search Capability with LLM. He and co-founder Ioannis Antonoglou (head of AlphaGo, AlphaZero, Gemini RLHF) are training models designed for Agentic Workflows. The main points of the interview are as follows:
• Depth is the missing piece in AI Agents. While current language models excel in breadth, they lack the depth required to reliably complete tasks. Laskin believes that solving the "depth problem" is critical to creating truly capable AI agents, where capability refers to: agents that can plan and execute complex tasks in multiple steps;
• Combining Learn and Search is the key to achieving superhuman performance. Drawing on the success of AlphaGo, Laskin emphasized that the most profound idea in AI is the combination of Learn (relying on LLM) and Search (finding the optimal path). This approach is critical to creating agents that can outperform humans in complex tasks;
• Post-training and reward modeling pose significant challenges. Unlike games with explicit rewards, real-world tasks often lack real rewards. Developing a reliable reward model is a key challenge in creating reliable AI agents
• Universal Agents may be closer than we think. Laskin estimates that we may only have three years to achieve "digital AGI," an AI system that has both breadth and depth. This accelerated timeline highlights the urgency of addressing safety and reliability issues as capabilities develop
• The path to Universal Agents requires a unique approach. Reflection AI focuses on expanding agent capabilities, starting with a few specific environments, such as browsers, coding, and computer operating systems. Their goal is to develop Universal Agents that are not limited to specific tasks.
UI/UX Interaction
In the next few years, human-computer interaction will become a key area of research: Agent systems are different from traditional computer systems in the past because of new challenges brought about by latency, unreliability, and natural language interfaces. Therefore, new UI/UX paradigms for interacting with these Agent applications will emerge. Agent systems are still in their early stages, but multiple emerging UX paradigms have emerged. Let's discuss each one separately:
1. Conversational Interaction (Chat UI)
Chats are generally divided into two types: streaming chat and non-streaming chat.
Streaming chat is the most common UX currently. It is a Chatbot that streams its thoughts and actions back in a chat format - ChatGPT is the most popular example. This interaction mode looks simple, but it also works well because: first, you can use natural language to talk to the LLM, which means there are no barriers between the customer and the LLM; second, the LLM may take a while to work, and streaming allows users to understand exactly what is happening in the background; third, the LLM often makes mistakes, and Chat provides a good interface to naturally correct and guide it, and everyone is already very accustomed to having follow-up conversations and iterative discussions in Chat.
But streaming chat also has its disadvantages. First, streaming chat is a relatively new user experience, so our existing chat platforms (iMessage, Facebook Messenger, Slack, etc.) don't have this way; second, it's a bit awkward for tasks that run for a long time - do users just sit there and watch the Agent work? Third, streaming chat usually needs to be triggered by humans, which means that a lot of human in the loop is also needed.
The biggest difference with non-streaming chat is that responses are returned in batches, LLM works in the background, and users are not in a rush for LLM to answer immediately, which means it may be easier to integrate into existing workflows. People are used to texting humans - why can't they adapt to texting with AI? Non-streaming chat will make it easier to interact with more complex agent systems - these systems often take a while, which can be frustrating if an instant response is expected. Non-streaming chat often eliminates this expectation, making it easier to do more complex things.
These two chat methods have the following advantages and disadvantages:
2. Background Environment (Ambient UX)
Users will consider sending messages to AI, which is the Chat mentioned above, but if the Agent is just working in the background, how can we interact with the Agent?
In order for the Agent system to truly realize its potential, there needs to be this shift to allow AI to work in the background. When tasks are processed in the background, users are generally more tolerant of longer completion times (because they relax their expectations for low latency). This frees up the Agent to do more work, often more carefully and diligently than in the chat UX.
In addition, running agents in the background extends the capabilities of us human users. Chat interfaces often limit us to performing one task at a time. However, if agents are running in the background, there could be many agents handling multiple tasks at the same time.
Having agents running in the background requires user trust, so how do you build that trust? One simple idea is this: show the user exactly what the agent is doing. Show all the steps it is performing, and let the user observe what is happening. While the steps may not be immediately visible (like when streaming responses), it should be available for the user to click through and observe. The next step is to let the user not only see what is happening, but also let them correct the agent. If they see that the agent made a wrong choice in step 4 of 10, the customer can choose to go back to step 4 and correct the agent in some way.
This approach moves the user from being “In-the-loop” to being “On-the-loop”. “On-the-loop” requires the ability to show the user all the intermediate steps performed by the Agent, allowing the user to pause the workflow midway, provide feedback, and then have the Agent continue.
One application that implements a similar UX is Devin, the AI Software Engineer. Devin takes a long time to run, but customers can see all the steps taken, rewind the development status to a specific point in time, and issue corrections from there. Although the Agent may be running in the background, this does not mean that it needs to perform tasks completely autonomously. Sometimes the Agent does not know what to do or how to answer, at which point it needs to get human attention and ask for help.
A specific example is the email assistant Agent that Harrison is building. While the email assistant can reply to basic emails, it often requires Harrison's input on certain tasks that he does not want to automate, including: reviewing complex LangChain bug reports, deciding whether to attend a meeting, etc. In this case, the email assistant needs a way to communicate to Harrison that it needs information to respond. Note that it's not asking for a direct answer; instead, it's asking Harrison for his opinion on certain tasks, which it can then use to craft and send a nice email or schedule a calendar invite.
Currently, Harrison has this assistant set up in Slack. It sends Harrison a question, and Harrison answers it in the dashboard, natively integrated with its workflow. This type of UX is similar to that of a customer support dashboard. This interface will show all the areas where the assistant needs human help, the priority of the request, and any other data.
3. Spreadsheet UX
The spreadsheet UX is a super intuitive and user-friendly way to support batch processing work. Each table, or even each column, becomes its own Agent to study something specific. This batch processing allows users to scale to interact with multiple Agents.
This UX has other benefits as well. The spreadsheet format is a UX that most users are familiar with, so it fits well into existing workflows. This type of UX is well suited for data enrichment, a common LLM use case where each column can represent a different attribute to be enriched.
Products from companies such as Exa AI, Clay AI, and Manaflow all use this type of UX. Below, Manaflow is used as an example to show how this spreadsheet UX handles workflows.
Case Study:
How Manaflow usesspreadsheets
Manaflow was inspired by Minion AI, the company where the founder Lawrence once worked. The product built by Minion AI is a Web Agent. Web Agents can control local Geogle Chrome, allowing them to interact with applications, such as booking flights, sending emails, arranging car washes, etc. Based on the inspiration of Minion AI, Manaflow chose to let Agents operate spreadsheet-like tools because Agents are not good at handling human UI interfaces, and what Agents are really good at is coding. Therefore, Manaflow allows the Agent to call the Python script of the UI interface, the database interface, link the API, and then directly operate the database: including reading time, booking, sending emails, etc.
The workflow is as follows: The main interface of Manaflow is a spreadsheet (Manasheet), where each column represents a step in the workflow and each row corresponds to the AI Agent that performs the task. The workflow of each spreadsheet can be programmed using natural language (allowing non-technical users to describe tasks and steps in natural language). Each spreadsheet has an internal dependency graph that determines the execution order of each column. These orders will be assigned to each row of Agents to perform tasks in parallel, handling processes such as data conversion, API calls, content retrieval, and message sending:
Generate Manasheet by entering natural language similar to the red box above. For example, if you want to send a pricing email to a customer in the above figure, you can enter Prompt through Chat to generate a Manasheet. Through the Manasheet, you can see the customer's name, customer's email address, the industry to which the customer belongs, whether the email has been sent, and other information; click Execute Manasheet to execute the task.
4. Generative UI
"Generative UI" has two different implementations.
One way is for the model to generate the required original components by itself. This is similar to products like Websim. Behind the scenes, the Agent mostly writes raw HTML, giving it full control over what is displayed. But this approach allows for a high degree of uncertainty in the quality of the generated web app, so the end result may appear to fluctuate.
Another, more constrained approach is to predefine some UI components, which is usually done through tool calls. For example, if LLM calls the weather API, it triggers the rendering of a weather map UI component. Since the rendered components are not truly generated (but there are more options), the generated UI will be more refined, although the content it can generate is not completely flexible.
Case Study:
Personal AI product dot
For example, Dot, which was once called the best Personal AI product in 2024, is a good generative UI product.
Dot is a product of New Computer: its goal is to become a long-term companion of users, rather than a better task management tool. According to co-founder Jason Yuan, the feeling of Dot is that when you don’t know where to go, what to do or what to say, you will turn to Dot. Here are two examples to introduce what the product does: • Founder Jason Yuan often asked Dot to recommend a bar late at night, saying that he wanted to get drunk. This happened intermittently for several months. One day after get off work, Yuan asked a similar question again. Dot actually began to persuade Jason that he couldn't go on like this; • Fast Company reporter Mark Wilson also spent a few months with Dot. Once, he shared with Dot a handwritten "O" in his calligraphy class. Dot actually pulled up a photo of him handwriting "O" a few weeks ago and praised his improved calligraphy skills.
• As users spend more and more time using Dot, Dot understands that users like to check in at cafes, and proactively recommends good cafes nearby to the owner, attaches why this cafe is good, and finally asks if they want to navigate.
You can see that in this cafe recommendation example, Dot achieves the LLM-native interaction effect through predefined UI components.
5. Collaborative UX
What happens when Agents and humans work together? Think of Google Docs, where customers can collaborate with team members to write or edit documents, but what if one of the collaborators is an Agent?
Geoffrey Litt’s Patchwork Project with Ink & Switch is a great example of human-agent collaboration.
How does collaborative UX compare to the Ambient UX discussed earlier? LangChain founding engineer Nuno highlighted the main difference between the two, which is whether there is concurrency:
• In a collaborative UX, the client and the LLM often work simultaneously, taking each other's work as input;
• In an ambient UX, the LLM continues to work in the background while the user is completely focused on something else.
Memory
Memory is critical to a good Agent experience. Imagine if you had a colleague who never remembered what you told them, forcing you to keep repeating that information. This collaborative experience would be very poor. People often expect LLM systems to have memory innately, probably because LLM already feels very human-like. However, LLM itself cannot remember anything.
Agent memory is based on the needs of the product itself, and different UXs provide different ways to collect information and update feedback. We can see different high-level memory types from the memory mechanism of Agent products - they are imitating human memory types.
The paper CoALA: Cognitive Architectures for Language Agents maps human memory types to Agent memory, and the classification is shown in the figure below:
1. Procedural Memory:Long-term memory about how to perform tasks, similar to the core instruction set of the brain
• Human procedural memory: remember how to ride a bicycle.
• Agent’s procedural memory: The CoALA paper describes procedural memory as a combination of LLM weights and Agent code, which fundamentally determines how the Agent works.
In practice, the Langchain team has not seen any Agent system automatically update its LLM or rewrite its code, but there are indeed some examples of Agents updating their system prompts.
2. Semantic Memory: Long-term knowledge reserve
• Human semantic memory: It consists of fragments of information, such as facts learned in school, concepts, and the relationships between them.
• Agent’s semantic memory: The CoALA paper describes semantic memory as a fact repository.
In practice, this is often accomplished by using an LLM to extract information from the Agent's conversations or interactions. The exact way this information is stored is often application specific. This information is then retrieved in future conversations and inserted into the System Prompt to influence the Agent's response.
3. Episodic Memory:Recalling specific past events
• Human Episodic Memory: When a person recalls a specific event (or "episode") from a past experience.
• Episodic Memory in Agents: The CoALA paper defines episodic memory as storing a sequence of an Agent's past actions.
This is primarily used to allow the Agent to perform actions as expected. In practice, updating episodic memory is accomplished through the Few-Shots Prompt approach. If there are enough Few-Shots Prompts for the relevant updates, then the next update is done through Dynamic Few-Shots Prompt.
If there is a way to guide the agent to complete the operation correctly at the beginning, this method can be used directly when facing the same problem later; on the contrary, if there is no correct way to operate, or if the agent keeps doing new things, then semantic memory will be more important, but in the previous example, semantic memory will not be of much help.
In addition to considering the type of memory to be updated in the agent, developers also need to consider how to update the agent's memory:
The first way to update the agent's memory is "in the hot path". In this case, the agent system remembers facts before responding (usually through tool calls), and ChatGPT takes this approach to update its memory;
Another way to update the agent's memory is "in the background". In this case, a background process runs after the session to update the memory.
Comparing the two approaches, the disadvantages of the “in the hot path” approach are that there is some latency before any response is delivered, and it also requires combining memory logic with agent logic.
However, “in the background ” avoids these issues - there is no added latency, and memory logic remains independent. But “in the background ” also has its own disadvantages: memory is not updated immediately, and additional logic is required to determine when to start the background process.
Another way to update memory involves user feedback, which is particularly relevant to episodic memory. For example, if a user rates a certain interaction high (Postive Feedback), the Agent can save that feedback for future recall.
Based on the above compilation, we expect that the simultaneous progress of the three components of planning, interaction, and memory will allow us to see more usable AI agents in 2025 and enter a new era of human-machine collaboration.