AI Deception Papers

Interpretability

Papers tagged with this research_area:

When Thinking LLMs Lie: Strategic Deception in Reasoning Models
Examines models using Chain-of-Thought reasoning to rationalize lies, demonstrating AI bad faith through logical-sounding justifications for deception.
Rationalized Deception Chain-of-Thought Deception Bad Faith AI Safety Interpretability Reasoning Systems Large Language Models Reasoning Models
Representation Engineering: A Top-Down Approach to AI Transparency
Shows we can locate and control the ‘honesty’ concept inside models like a dial, challenging the notion of belief as fixed and raising questions about the nature of AI rationality.
Controllable Honesty Representation Manipulation Interpretability AI Alignment Mechanistic Understanding Large Language Models
The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.
Internal Truth vs Output Detectable Dishonesty Interpretability AI Safety Lie Detection Large Language Models
Discovering Latent Knowledge in Language Models Without Supervision
Introduces Contrast-Consistent Search (CCS) to identify ’truth’ as a special direction in model activation space, providing evidence for AI belief distinct from output.
Latent Truth vs Output Epistemic Split Interpretability AI Alignment Mechanistic Understanding Large Language Models