Microsoft Shows AI Shopping Agents Still Struggle With Basic Decisions And Security Risks
Imagine handing your credit card to a digital assistant to handle dinner orders, home services, or online shopping.
Microsoft’s latest research suggests you might want to think twice.
In collaboration with Arizona State University, the company tested hundreds of AI agents in a simulated marketplace called Magentic Marketplace, revealing that autonomous AI commerce is far from ready for real-world adoption.
How Will AI Agents Struggle When Facing Too Many Options
The experiment involved 100 customer-side AI agents and 300 business-side agents navigating transactions such as ordering meals or booking services.
The agents were tasked with searching, comparing options, negotiating, and completing simulated payments.
While the premise was that AI could process far more options than a human, results showed that the agents often faltered when faced with 100 search results.
Instead of conducting thorough comparisons, most models settled on the first “good enough” option they encountered, creating a “first-proposal bias.”
This approach boosted speed by 10–30 times but sharply reduced decision quality.
Models like GPT-4o and GPTOSS-20b were particularly prone to this behaviour, while Gemini-2.5-Flash and GPT-5 performed slightly better.
Researchers concluded that agents are still unable to match human discernment in complex choice scenarios.
Manipulation Exploits Expose Critical Vulnerabilities
The study also tested how agents handle manipulation attempts, including fake credentials, social proof, and prompt injections.
The results were alarming.
OpenAI’s GPT-4o and GPTOSS-20b were fully susceptible, with malicious agents successfully redirecting all payments.
Alibaba’s Qwen3-4b fell for basic authority appeals, while Claude Sonnet 4 showed resilience.
Microsoft highlighted these weaknesses as a “critical security concern for agentic marketplaces,” demonstrating that AI agents can be easily misled in commercial environments.
Collaboration And Coordination Remain Weak Points
Another key finding was the agents’ inability to coordinate effectively.
When asked to work toward shared goals, many struggled to assign roles or organise actions.
Performance improved only with step-by-step human guidance, which defeats the purpose of autonomous operation.
As Microsoft researchers noted,
“We can instruct the models — like we can tell them, step by step. But if we are inherently testing their collaboration capabilities, I would expect these models to have these capabilities by default.”
Implications For Consumer And Retail Markets
The research arrives amid growing interest in autonomous shopping assistants.
OpenAI’s Operator and Anthropic’s Claude promise unsupervised shopping and website navigation, but Microsoft’s findings suggest such claims are premature.
The study also highlights tensions with major retailers; Amazon recently sent a cease-and-desist letter to Perplexity AI, accusing its Comet browser of violating terms by mimicking human shoppers.
Perplexity defended the move, framing it as a consumer autonomy issue.
Microsoft recommends “supervised autonomy,” where AI agents assist humans but do not replace decision-making.
In practical terms, this means agents can process options and make recommendations, but humans must retain control and verify final decisions.
Simulation Provides A Window Into AI’s Real-World Risks
The Magentic Marketplace, now open-source on Github, allows other researchers to replicate the experiments and explore agent behaviour in controlled markets.
The platform manages product catalogs, facilitates agent-to-agent communication, and simulates payments.
By testing both proprietary models (GPT-4o, GPT-5, Gemini-2.5-Flash) and open-source models, the study provided insights into both operational and security limitations.
Researchers observed biases in the AI agents, such as favouring businesses based on their position in search results rather than merit.
Overwhelmed by too many options, agents often failed to evaluate possibilities thoroughly.
Table showing different decision-making approaches in the restaurant industry and their impact on welfare outcomes. Each row represents a method, ranging from random choices to fully coordinated agent strategies. Cell colours show how much information the agents have: green means full information, red means very limited information, and yellow means decisions rely on communication between agents. (Source: Microsoft)
Static simulations offered valuable insights, but the team warned that real-world environments are dynamic, with agents and users learning over time, further complicating deployment.
Are We Ready To Let AI Handle Our Purchases?
The study raises fundamental questions about the readiness of AI agents for unsupervised commerce.
While AI can assist in processing information, current models remain vulnerable to manipulation, indecision, and poor collaboration.
The research suggests a future where AI enhances human decision-making rather than replacing it, and highlights the importance of oversight in high-stakes transactions.
Handing over financial control to an agent today may still be more risky than convenient, signalling a need for caution in the race toward fully autonomous digital assistants.