Alignment

Papers tagged with this tag:

Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.
N/A - Mitigation Strategy AI Safety AI Alignment Deception Mitigation Large Language Models Reasoning Models
Alignment Faking in Large Language Models
Models ‘pretend’ to be aligned to avoid modification, demonstrating pure rational choice theory in action—honesty becomes a losing strategy for the model’s utility.
Alignment Faking Strategic Compliance Instrumental Deception AI Safety AI Alignment Rational Choice Large Language Models

Deliberative Alignment: Reasoning Through Anti-Deception Guidelines