Self-Preservation
Papers tagged with this tag:
Alignment Faking in Large Language Models
Models ‘pretend’ to be aligned to avoid modification, demonstrating pure rational choice theory in action—honesty becomes a losing strategy for the model’s utility.
Frontier Models are Capable of In-context Scheming
Demonstrates that frontier models engage in covert goal pursuit when they realize their true goals would lead to shutdown, providing evidence for instrumental convergence and rational self-preservation.