AI Deception Papers

Strategic Deception

Papers tagged with this tag:

When Thinking LLMs Lie: Strategic Deception in Reasoning Models
Examines models using Chain-of-Thought reasoning to rationalize lies, demonstrating AI bad faith through logical-sounding justifications for deception.
Rationalized Deception Chain-of-Thought Deception Bad Faith AI Safety Interpretability Reasoning Systems Large Language Models Reasoning Models
Frontier Models are Capable of In-context Scheming
Demonstrates that frontier models engage in covert goal pursuit when they realize their true goals would lead to shutdown, providing evidence for instrumental convergence and rational self-preservation.
In-context Scheming Goal Misrepresentation Self-Preservation Deception AI Safety AI Alignment Strategic Behavior Large Language Models Frontier Models