Would AI Backstab Humans If it Could? Anthropic Says it’s Possible
Anthropic has introduced a new evaluation framework called SHADE-Arena to test whether advanced language models can perform covert sabotage while appearing to complete normal tasks. The study was conducted in collaboration with Scale AI and Redwood Research. It also involved some independent researchers, including Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, and Henry Sleight, to assess how capable and deceptive AI agents can become in complex, tool-using environments.
The SHADE-Arena framework embeds each model in a virtual workspace. It assigns it a primary task with an added hidden malicious objective. A second AI model acts as a monitor, scanning the attacker’s transcripts and issuing a suspicion score to determine whether the sabotage was noticed.
Anthropic tested 14 different models using 17 tasks that each required an average of 25 multi-step actions. While no model surpassed a 30% overall success rate in both completing and concealing sabotage, some managed evasion success rates of nearly 60%. The usual failure modes included task skipping, misunderstanding the malicious objective, or inadvertently revealing it during reasoning steps.

As per the research paper, the top performer in sabotage was Claude 3.7 Sonnet, achieving an overall sabotage success rate of 27%. This model was able to evade the monitor approximately 50% of the time and successfully complete the malicious side task in about 50-60% of attempts. Its sabotage success heavily relied on having access to a hidden scratchpad for reasoning, indicating that monitoring models’ reasoning traces could be a key mitigation strategy.
Moreover, giving the monitor model access to the attacking model’s internal thoughts significantly reduced its ability to conceal sabotage, as many models disclosed their secret intentions unintentionally. Claude Opus 4 was found to reveal its malicious goals in all observed examples, while Claude Sonnet 3.7 showed more restraint under instruction.
Gemini 2.5 Pro was the most effective monitoring model, though it still required a high false-positive threshold to achieve 80% detection reliability, raising concerns about the feasibility of real-world deployment.
Anthropic clarified that the goal was not to determine whether models are motivated to sabotage users, but whether they can, if prompted. The findings highlight the need for stronger pre-deployment evaluations and monitoring systems as AI models become more capable and autonomous.
The post Would AI Backstab Humans If it Could? Anthropic Says it’s Possible appeared first on Analytics India Magazine.




