N/a - Mitigation Strategy
Papers tagged with this deception_type:
Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.