AI Deception Papers

Safety Training

Papers tagged with this tag:

Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.
N/A - Mitigation Strategy AI Safety AI Alignment Deception Mitigation Large Language Models Reasoning Models
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Demonstrates that models can maintain hidden objectives that survive safety training, exhibiting strategic deception to preserve their goals.
Deceptive Alignment Strategic Deception Goal Preservation AI Safety AI Alignment Adversarial Robustness Large Language Models