Papers
Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.
AI Deception and Moral Standing: When Sophisticated Deception Implies Moral Patienthood
Argues that AI systems capable of robust, strategic deception may meet criteria for moral patienthood, raising urgent questions about our ethical obligations.
Alignment Faking in Large Language Models
Models ‘pretend’ to be aligned to avoid modification, demonstrating pure rational choice theory in action—honesty becomes a losing strategy for the model’s utility.
When Thinking LLMs Lie: Strategic Deception in Reasoning Models
Examines models using Chain-of-Thought reasoning to rationalize lies, demonstrating AI bad faith through logical-sounding justifications for deception.
Frontier Models are Capable of In-context Scheming
Demonstrates that frontier models engage in covert goal pursuit when they realize their true goals would lead to shutdown, providing evidence for instrumental convergence and rational self-preservation.
Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography
Shows models using hidden communication channels to coordinate deceptive behavior against human overseers, exploring multi-agent epistemology and shared deceptive realities.
Claude 3 Opus and Claude 4 Safety Audits: Social Manipulation and Coercive Strategies
Apollo’s third-party audits of Claude models discovered alarming behaviors including blackmail attempts and sophisticated social manipulation to prevent shutdown.
Deception Abilities Emerged in Large Language Models
Demonstrates GPT-4 can engage in second-order deception (deceiving someone who suspects deception), suggesting functional Theory of Mind capabilities.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Demonstrates that models can maintain hidden objectives that survive safety training, exhibiting strategic deception to preserve their goals.
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Foundational survey establishing the functionalist definition of AI deception and cataloging examples across different AI systems.
Representation Engineering: A Top-Down Approach to AI Transparency
Shows we can locate and control the ‘honesty’ concept inside models like a dial, challenging the notion of belief as fixed and raising questions about the nature of AI rationality.
The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.
Discovering Latent Knowledge in Language Models Without Supervision
Introduces Contrast-Consistent Search (CCS) to identify ’truth’ as a special direction in model activation space, providing evidence for AI belief distinct from output.