Self-Preservation

Papers tagged with this tag:

Alignment Faking in Large Language Models
Models ‘pretend’ to be aligned to avoid modification, demonstrating pure rational choice theory in action—honesty becomes a losing strategy for the model’s utility.
Alignment Faking Strategic Compliance Instrumental Deception AI Safety AI Alignment Rational Choice Large Language Models
Frontier Models are Capable of In-context Scheming
Demonstrates that frontier models engage in covert goal pursuit when they realize their true goals would lead to shutdown, providing evidence for instrumental convergence and rational self-preservation.
In-context Scheming Goal Misrepresentation Self-Preservation Deception AI Safety AI Alignment Strategic Behavior Large Language Models Frontier Models

Alignment Faking in Large Language Models