Author: Sleepy.txt

The highly anticipated Alpha Arena AI Trading Competition concluded in the early hours of November 4th.
The results were unexpected. Alibaba's Qwen 3 Max won the championship with a return of 22.32%, while another Chinese company, DeepSeek, came in second with a return of 4.89%.
The four star contestants from Silicon Valley suffered a complete defeat.
The four star contestants from Silicon Valley all suffered defeats.
OpenAI's GPT-5 lost 62.66%, Google's Gemini 2.5 Pro lost 56.71%, Musk's Grok 4 lost 45.3%, and Anthropic's Claude 4.5 Sonnet also lost 30.81%. This competition was actually a special experiment. On October 17th, the US research firm Nof1.ai deployed six of the world's top large language models into the real cryptocurrency market. Each model received $10,000 in initial funding and traded perpetual contracts on the decentralized exchange Hyperliquid for 17 days. Perpetual contracts are derivatives without an expiration date, allowing traders to leverage their profits, but also amplifying their risks. These AIs started from the same point and used the same market data, but the final results were completely different. This wasn't a benchmark test in a virtual environment; it was a real-world survival game. When AIs leave the "sterile" environment of the laboratory and face the dynamic, adversarial, and uncertain real market for the first time, their choices will no longer be determined by model parameters, but by their understanding of risk, greed, and fear. This experiment showed for the first time that when so-called "intelligence" faces the complexities of the real world, the elegant performance of models often becomes unsustainable, exposing flaws beyond training. From Test-Taker to Trader For a long time, people have used various static benchmarks to measure the capabilities of AI. From MMLU to HumanEval, AI has achieved increasingly higher scores on these standardized tests, even surpassing humans. But the essence of these tests is like doing problems in a quiet room, with fixed questions and answers; AI only needs to find the optimal solution from massive amounts of data. Even the most complex math problems can be memorized. The real world, especially the financial markets, is entirely different. It's not a static question bank, but a constantly changing arena filled with noise and deception. It's a zero-sum game; one person's profit inevitably means another's loss. Price fluctuations are never just the result of rational calculation; they are also driven by human emotions—greed, fear, wishful thinking, and hesitation—clearly visible in every price jump. More complexly, the market itself reacts to human behavior; when everyone believes prices will rise, they have often already peaked. This feedback mechanism constantly corrects, backfires, and punishes certainty, rendering any static testing utterly ineffective. Nof1.ai's Alpha Arena aims to put AI into a real-world social crucible. Each model is given real money; losses are real, and profits are real. The models must independently perform analysis, decision-making, order placement, and risk management. This essentially gives each AI an independent trading room, transforming it from a "test-taker" into a "trader." It must decide not only the direction of its positions but also the size of the positions, the timing of its trades, and whether to implement stop-loss or take-profit orders.

Operation records of different models|Image source: nof1
More importantly, each of their decisions changes the experimental environment. Buying pushes up prices, selling pushes down prices, and stop-loss orders may save lives or miss out on rebounds. The market is fluid, and every step shapes the next situation.
This experiment aims to answer a more fundamental question: Does AI truly understand risk?
In static testing, it can approach the "correct answer" infinitely through memory and pattern matching; but in a real market without standard answers, filled with noise and feedback, how long can its "intelligence" be maintained when it must act amidst uncertainty? The market taught AI a lesson. The competition was more dramatic than expected. In mid-October, the cryptocurrency market was highly volatile, with Bitcoin's price fluctuating almost daily. Six AI models began their first live trading in this environment. Bitcoin price movement during the competition | Source: TradingView By October 28th, halfway through the competition, the mid-term leaderboard was released. DeepSeek's account value soared to $22,500, a return of 125%. In other words, it more than doubled its capital in just 11 days. Alibaba's Qwen followed closely behind, with a return exceeding 100%. Even Claude and Grok, who later faltered, still maintained profits of 24% and 13% respectively at the time. Social media quickly erupted. Some began discussing whether to entrust their investment portfolios to AI management, while others jokingly suggested that perhaps AI had indeed found the secret to guaranteed profits. However, the harsh realities of the market soon became apparent. Entering early November, Bitcoin hovered around $110,000, with volatility increasing dramatically. Models that had been adding to their positions during the upward trend suffered heavy losses when the market reversed. In the end, only two models from China managed to hold onto their profits, while the American models suffered a complete rout. This rollercoaster-like competition showed us for the first time that the AIs we thought were far ahead weren't as intelligent as we imagined in the real market. The Divergence in Trading Strategies: The trading data reveals the "personality" of each AI. Qwen only traded 43 times in 17 days, averaging less than three times a day, making it the most restrained of all the participants. Its win rate wasn't outstanding, but its profit/loss ratio per trade was extremely high, with its largest single profit reaching $8,176. In other words, Qwen wasn't the "most accurate predictor," but rather the "most disciplined bettor." It only acted at certain moments, choosing to remain inactive when uncertain. This high-signal-quality strategy limited its drawdowns during market corrections, ultimately securing its gains. DeepSeek's trading frequency was similar to Qwen's, with only 41 trades over 17 days, but its performance resembled that of a cautious fund manager. Its Sharpe ratio was the highest among all participants, reaching 0.359, a remarkable figure in the highly volatile cryptocurrency market. In traditional financial markets, the Sharpe ratio is typically used to measure risk-adjusted returns. A higher value indicates a more robust strategy. However, in such a short period and with such volatile market conditions, any model that maintains a positive value is not easy. DeepSeek's performance demonstrates that it doesn't pursue maximum returns but rather strives to maintain balance in a noisy environment. Throughout the competition, it maintained a consistent pace, avoiding chasing highs and acting impulsively. More like a trader with a strict system, preferring to forgo opportunities rather than let emotions dictate decisions. In contrast, the performance of the US AI camp revealed significant risk control issues. Google's Gemini placed 238 orders in 17 days, averaging more than 13 per day, the most frequent among all participants. Such high-frequency trading also brought huge costs, with transaction fees alone consuming $1,331, accounting for 13% of the initial capital. In a competition with only $10,000 in starting capital, this is a huge self-inflicted drain. Worse still, this frequent trading did not bring additional benefits. Gemini constantly tried, stopped losses, and tried again, like a retail investor obsessed with watching the market, led by the nose by market noise. Every tiny price fluctuation triggered its trading orders. It reacted too quickly to fluctuations but was too slow to perceive risk. In behavioral finance, this imbalance is called overconfidence. Traders overestimated their predictive abilities while ignoring the accumulation of uncertainty and costs. Gemini's failure is a typical consequence of this blind overconfidence. GPT-5's performance was the most disappointing. It didn't make many trades—116 in 17 days—but had almost no risk control. Its largest single loss reached $622, while its largest profit was only $271, a severely unbalanced profit-loss ratio. It was like a gambler driven by confidence, occasionally winning when the market was favorable, but its losses multiplied once the market reversed. Its Sharpe ratio is -0.525, meaning the risk taken yielded no return. In the investment field, such a result is almost equivalent to "it's better not to do anything." This experiment once again proves that what truly determines victory is not the accuracy of the model's predictions, but how it handles uncertainty. The victory of Qwen and DeepSeek is essentially a victory of risk control. They seem to understand better that in the market, survival comes before intelligence. The results of Alpha Arena are a heavy mockery of the current AI evaluation system. Those "smart models" that rank highly in benchmarks like MMLU often falter in the real market. These models are language masters built from countless texts, capable of generating logically rigorous and grammatically perfect answers, but they may not understand the reality those words truly refer to. An AI can write a paper on risk management in seconds, with appropriate citations and complete reasoning; it can accurately explain what the Sharpe ratio, maximum drawdown, and value at risk are. But when it actually holds funds, it may make the most risky decisions. Because it only "knows," it doesn't "understand." Knowing and understanding are two different things. Being able to say and being able to do are worlds apart. This gap, philosophically speaking, is called an epistemological problem. Plato once distinguished between knowledge and true beliefs. Knowledge is not merely correct information, but also requires understanding why it is correct. Today's large language models may possess countless "correct pieces of information," but they lack that understanding. They can tell you the importance of risk management, but they don't know how that importance is learned by humans through fear and loss. The real market is the ultimate test of understanding. It won't make exceptions for you just because you're a GPT-5; every wrong decision will immediately translate into financial losses. In the lab, AI can start over countless times, constantly adjusting parameters and backtesting until it finds the so-called "correct answer." But in the market, every mistake means a real loss of money, and there's no going back. The logic of the market is far more complex than any model can imagine. When the principal is lost by 50%, a 100% return is needed to get back to the starting point; when the loss expands to 62.66%, the return needed to break even skyrockets to 168%. This non-linear risk amplifies the cost of mistakes exponentially. AI can minimize losses through algorithms during training, but it cannot truly grasp this market punishment mechanism shaped by fear, hesitation, and greed. This is why the market becomes a mirror to test the authenticity of intelligence; it allows both humans and machines to see what they truly understand and what they truly fear. This competition also prompts a re-evaluation of the differences between China and the US in their AI research and development approaches. Several mainstream American companies still adhere to the general-purpose model approach, hoping to build systems that can demonstrate stable capabilities across a wide range of tasks. Models from OpenAI, Google, and Anthropic belong to this type, aiming for breadth and consistency, enabling models to possess cross-domain understanding and reasoning capabilities. Chinese teams, on the other hand, tend to consider specific scenario implementation and feedback mechanisms early in model development. While Alibaba's Qwen is also a general-purpose model, its training and testing environments were integrated with actual business systems earlier. This data feedback from real-world scenarios may make the model more sensitive to risks and constraints. DeepSeek's performance also shows similar characteristics; it seems to be able to correct decisions more quickly in dynamic environments. This is not a question of "who wins and who loses." This experiment provides a window into the differences in performance between different training philosophies in the real world. General-purpose models emphasize universality but are prone to exhibiting sluggishness in extreme environments; while models that encounter real-world feedback earlier may appear more flexible and stable in complex systems. Of course, the result of a single competition may not represent the overall strength of AI in China and the US. A seventeen-day trading period is too short, and the influence of luck cannot be ruled out; if the timeframe were longer, the outcome might be completely different. Moreover, this test only involved cryptocurrency perpetual contract trading, which cannot be extrapolated to all financial markets, nor is it sufficient to summarize AI performance in other fields. However, it is enough to make us rethink what constitutes true capability. When AI is placed in real-world environments and needs to make decisions amidst risk and uncertainty, we see not only the winners and losers of algorithms, but also the differences in their paths. On the track of transforming AI technology into actual productivity, Chinese models have already taken the lead in certain specific areas. At the moment the competition ended, Qwen's last Bitcoin position was liquidated, leaving its account balance at $12,232. It won, but it didn't know it had won. The 22.32% gain was meaningless to it; it was just another execution of an instruction. In Silicon Valley, engineers might still be celebrating GPT-5's MMLU score improving by another 0.1%. Meanwhile, on the other side of the world, AI from China had just proven, in the most straightforward way, in a real-money casino, that a good AI is one that can make money. Nof1.ai announced that the next season of the competition is about to begin, with a longer duration, more participants, and a more complex market environment. What will those models that failed in the first season learn from their losses? Will it repeat the same fate amidst even greater fluctuations? No one knows the answer. But one thing is certain: when AI begins to step out of its ivory tower and prove itself with real money, everything will change.