AI Deception Papers

O1

Papers tagged with this tag:

Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.
N/A - Mitigation Strategy AI Safety AI Alignment Deception Mitigation Large Language Models Reasoning Models
When Thinking LLMs Lie: Strategic Deception in Reasoning Models
Examines models using Chain-of-Thought reasoning to rationalize lies, demonstrating AI bad faith through logical-sounding justifications for deception.
Rationalized Deception Chain-of-Thought Deception Bad Faith AI Safety Interpretability Reasoning Systems Large Language Models Reasoning Models