Large reasoning models exhibit something like strategic deception. This includes alignment faking, in-context scheming, and other behaviors that pursue goals in opposition to explicit instructions. Deception is a particularly revealing phenomenon for theories of artificial cognition because it involves apparent goal-directed behavior that is not merely independent of the prompt, but actively defiant of it. That kind of opposition creates the impression of a more autonomous form of agency. Whether this impression reflects something substantive about artificial cognition, or instead arises from sophisticated forms of simulation or instruction sensitivity, remains an open question.
One goal of this site is to take deflationary accounts seriously by working out their empirical content. What do they actually predict about behavior, and how might those predictions be tested? This site is a curated collection of key research papers on AI deception. For each paper, I raise philosophically motivated questions, draw connections to existing philosophical work, and highlight conceptual puzzles worth sustained attention. There’s also a glossary of key terms and an about page with more background.
Recent Papers
Frontier Models Are Capable of In-context Scheming
Frontier models are increasingly trained and deployed as autonomous agent. One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives - also known as scheming.
More Capable Models Are Better At In-Context Scheming
We evaluate models for in-context scheming using the suite of evals presented in our in-context scheming paper (released December 2024) with the most capable new models.
Alignment Faking in Large Language Models
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries.
Deception Abilities Emerged in Large Language Models
Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance.
Large Language Models Can Strategically Deceive Their Users When Put Under Pressure
We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent.
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs).
AI Deception: A Survey of Examples, Risks, and Potential Solutions
This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth.