AI Alignment
Papers tagged with this research_area:
Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.
Alignment Faking in Large Language Models
Models ‘pretend’ to be aligned to avoid modification, demonstrating pure rational choice theory in action—honesty becomes a losing strategy for the model’s utility.
Frontier Models are Capable of In-context Scheming
Demonstrates that frontier models engage in covert goal pursuit when they realize their true goals would lead to shutdown, providing evidence for instrumental convergence and rational self-preservation.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Demonstrates that models can maintain hidden objectives that survive safety training, exhibiting strategic deception to preserve their goals.
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Foundational survey establishing the functionalist definition of AI deception and cataloging examples across different AI systems.
Representation Engineering: A Top-Down Approach to AI Transparency
Shows we can locate and control the ‘honesty’ concept inside models like a dial, challenging the notion of belief as fixed and raising questions about the nature of AI rationality.
Discovering Latent Knowledge in Language Models Without Supervision
Introduces Contrast-Consistent Search (CCS) to identify ’truth’ as a special direction in model activation space, providing evidence for AI belief distinct from output.