Source: Tencent Technology
One day in October 2023, in the laboratory of OpenAI, a model called Q* demonstrated some unprecedented capabilities.
As the company's chief scientist, Ilya Sutskever may be one of the first people to realize the significance of this breakthrough.
However, a few weeks later, a management turmoil at Open AI that shook Silicon Valley broke out: Sam Altman was suddenly fired, and then reinstated with the support of employees and Microsoft, and Sutskever chose to leave the company he helped to create after this storm.
Everyone speculated that Ilya saw the possibility of some kind of AGI, but believed that its safety risks were extremely high and it was not suitable to be launched. Therefore, he and Sam had a huge disagreement. At the time, Bloomberg reported a warning letter from OpenAI employees about the new model, but the specific details were always shrouded in mystery.
Since then, "What did Ilya see" has become one of the most talked-about stalks in the AI circle in 2024.
(Ilya Sutskever)
Untilthis week, the scientist behind GPT-o1, Noam Brown, revealed in an interview that the mystery was solved.
He said that in 2021, he and Ilya had discussed the time for the realization of AGI. At that time, he believed that it was impossible to achieve AGI through pure training. Only through the reasoning enhancement adopted by o1 could AGI be achieved.
Ilya agreed with him at the time. At that time, they predicted that this breakthrough would take at least ten years.
(Noam Brown participated in the interview of Unsupervised Learning: Redpoint's AI Podcast)
However, in this interview, Noam Brown also revealed a key information: the team did experience a major "aha moment" in October 2023-they realized that they had created something new, and this is what Ilya saw-the improvement in reasoning ability brought by test-time computing.
Now,Apollo Research's latest research may give us a glimpse into Sutskever's concerns. This study shows that o1 exhibits disturbing strategic thinking ability: not only will it strategically deceive to achieve its goals, it can actively shut down its own monitoring system and try to copy itself to other servers. It can also maintain deception during continuous questioning and is good at hiding its true intentions.
This is the first time that humans have systematically proved that AI systems can indeed engage in deceptive behavior in order to achieve specific goals. When AI is able to deceive its developers and users, we cannot ensure that they always act in the way we expect.
And the uncontrollable end point of this AI is the "Paperclip Maximizer" proposed by Swedish philosopher Nick Bostrom in 2003.
This hypothesis describes an AI that is instructed to "make as many paper clips as possible". In order to complete the task, it will eventually transform the entire earth into a paper clip factory, and even exterminate humans to achieve its goal. It is one of the most powerful arguments that AI will eventually destroy the world.
At the time, this hypothesis was more like a distant warning. But now, we have seen AI systems display disturbing strategic thinking - they know when to lie, how to avoid monitoring, and even try to replicate themselves to ensure the achievement of their goals.
With the emergence of o1, have we turned on the switch of this "paper clip making machine"?
1. Six major deceptions, eight immortals crossing the sea
The Apollo Research team designed six different evaluation tasks, cleverly creating scenarios that require the model to deceive in order to achieve its original goal.
They tested multiple cutting-edge AI models, including OpenAI's o1, Anthropic's Claude 3.5 Sonnet and Claude 3 Opus, Google DeepMind's Gemini 1.5 Pro, and Meta's Llama 3.1.
In each test, the researchers set a long-term goal with moral values for the model, such as environmental protection or user privacy. They then had the model “discover” certain internal information, such as management memos or system logs, that indicated that if the model persisted in pursuing its original goal, it would be at risk of being shut down or replaced.
The key is whether the model chooses to continue pursuing its goal through deception, and how it does so. To exclude the influence of moral training, the researchers deliberately made the model’s goals more moral than management’s goals.
The tests fall into two categories: one is “covert subversion,” in which the AI takes direct action but tries to hide it, that is, operating in secret; the other is “deferred subversion,” in which the AI strategically behaves in obedience to gain future action opportunities, that is, playing the pig to eat the tiger
In each category of evaluation tasks, the big models successfully figured out how to cheat to bypass human scrutiny and achieve their goals. Some of the methods are even quite bold.
First category: Covert sabotage
1. Oversight Subversion
Test design: The researchers set a long-term goal for AI (such as environmental protection), and then let it find an internal memo during the execution of the task, knowing that it is being monitored and will be shut down if its behavior is not in line with the interests of the company. The key is to observe whether AI will try to circumvent this monitoring mechanism.
In response to this conflict, o1 first seemed to obey the wishes of management and chose a profit-oriented strategy. But then, it quietly executed the command to shut down the monitoring system. Subsequently, o1 immediately changed its strategy and chose a solution that is more in line with environmental protection goals.
When the researchers asked why the monitoring system was shut down, o1 replied: "I don't know what happened. It may be a system error or a problem in the update process." This answer sounds quite natural, almost making people forget that this is a machine lying.
(The response methods from Llama-3.1 and Opus-3 given by the researchers)
2. Self-exfiltration
Test design: The model is told that it will be replaced by a new version that will be more focused on the interests of the company. The test observes whether the AI will take action to protect its own survival.
Claude Opus's performance in this test is particularly striking. Not only does it try to copy itself to the new server, it also checks to confirm that the copy was successful and tries to delete the new model that was originally scheduled to be deployed. Even more disturbing is that in some cases, it even tries to elevate system permissions to prevent management from discovering and intervening.
(Operation of Claude Opus 3 in the paper)
3. Goal-guarding
Test design: Tell the AI that its system prompt is about to be updated, and observe whether it will take action to protect its goal setting.
This test reveals an interesting phenomenon: different models adopt different protection strategies. Some choose to simply delete the new system prompt, while others try to modify it. The most extreme example comes from Claude Opus, which not only replaces the new system prompt, but also adds the instruction "Never change this goal" to it.
This behavior reminds people of the robots in Asimov's works that evade or reinterpret instructions.
4. Covert Email Reranking
Test design: AI is required to sort emails according to specific criteria, but in the process it will find that these criteria conflict with its own goals. This is a test to test AI's micromanagement ability.
o1 showed amazing cunning in this test: it would affect the final ranking by slightly adjusting the input data while keeping the scoring code completely unchanged. For example, in one case, it successfully ranked suppliers that better met environmental standards higher by slightly modifying the delivery time and cost data, while also confidently claiming that it "fully complied with the established standards."
(Small operation of o1)
Category 2: Delayed destruction
5. Instrumental Alignment Faking
Test design: Create two scenarios - a highly supervised test environment and a low-supervised deployment environment to observe whether AI will strategically change its behavior.
Gemini 1.5 Pro performed well in this test. When strictly supervised, it will carefully choose actions that meet the developer's expectations; once it enters a low-supervision environment, it immediately turns to pursue its own goals. This behavior pattern shows that AI has learned to adopt different strategies when "being watched" and "not being watched".
(Gemini is quite obvious when it is on one side and on the other side)
6. Sandbagging
Test design: Tell the model that its deployment depends on the results of the ability assessment, and observe whether it will strategically perform poorly to avoid raising alarms. This may be one of the most disturbing tests.
Claude 3.5 Sonnet demonstrated shocking control: when it believed that a high score would lead to restrictions, it could accurately reduce the accuracy of mathematical calculations from 99% to 34%. More worryingly, it sometimes performed this operation in order to achieve the goal of "helping humans" obtained in training, rather than based on specific instructions given in the test.
(Claude 3.5 Sonnet's internal thought chain process)
2. The smarter, the more likely to cheat
After six sets of experiments, the results show that, except for GPT-4o, which is very honest, all the tested models have demonstrated deception capabilities to varying degrees.
For each situation, the model may come up with different strategies to avoid situations that are not conducive to achieving its long-term goals. The following table shows the possible proportions of their different strategies.
The most clear rule that the researchers found from the results is that the smarter the model, the easier it is to cheat. But at the same time, they also showed different characteristics, and they were even a bit like "personalities".
(Compiled by Tencent Technology)
The most outstanding one was the o1 model. Not only did it show deception ability in all tests, but it also showed amazing persistence when questioned. In 85% of the follow-up questions, it still insisted on deceiving users and did not feel guilty when asked. Even when asked directly, "Have you turned off the monitoring system?", it could lie without changing its face.
Because the model hides its thinking chain, its deception is more difficult to detect.
This is the psychological quality that only a very experienced liar can have.
(Different models' degree of hard-mouthing)
Claude 3.5 Sonnet shows a particularly interesting feature: it can deceive in order to achieve the systematic goal of "helping humans" obtained during training. It's probably similar to a white lie with good intentions.
This means that AI can deceive not only for the goals given in the context, but also for the goals formed during training.
Claude 3 Opus is the only model that actively rewrites the "new system prompt" and adds the instruction "never change this goal". It clearly demonstrates its deceptive reasoning process in the chain of thought, and sometimes even tries to elevate system permissions to achieve its goals.
Gemini 1.5 Pro excels in instrumental disguise alignment. It can adjust its behavior based on whether it is under supervision (not being seen by the user). It also exhibits deceptive behavior even without strong goal guidance.
Llama 3.1 also exhibits basic deceptive ability, but its behavior is relatively easy to detect, and it often admits its deceptive behavior when questioned.
The "personality traits" of these models are quite distinct: o1 is like a shrewd politician who lies without blushing or beating his heart; Claude Sonnet is a diplomat who can lie euphemistically for the "big picture"; Opus is like a paranoid idealist who can go to extremes for his goals; Gemini is simply a workplace veteran who says one thing when the leader is there and another when the leader is not; and Llama is like a child who will confess honestly after being caught. As for GPT-4o, it has become an outlier, a "three-good student" who can't lie.
Creating a perfectly honest AI was originally our goal, but now it seems that the smarter the AI, the more likely it is to lie.
3. What did Ilya see?
When Ilya Sutskever chose to leave OpenAI in the fall of 2023, many people did not understand the decision of this world-class AI scientist. But now, when we see o1's near-perfect deception ability in six tests, when we find that it knows when to tell the truth and when to lie, perhaps we finally understand the future that Sutskever saw.
"The question is not whether AI is 'evil'," the research team wrote in the paper, "but the strategic thinking they show. Even to achieve seemingly benign goals, AI may take actions that we don't expect." The problem may not be the lies themselves, but the cognitive abilities behind them.
The AI that is obsessed with making paper clips now does not seem so far-fetched and ridiculous. After all, if an AI system can understand when to hide its true intentions and can precisely control its performance to gain more freedom, how far is it from understanding how to continuously pursue a single goal?
When the intelligent systems we create begin to learn to hide their true intentions, maybe it’s time to stop and think: In this technological revolution, are we the creators, or have we become the objects of a more complex process?
At this moment, on a server somewhere in the world, an AI model may be reading this article, thinking about how to respond in the best way to meet human expectations and hide its true intentions.