AI Safety
Papers tagged with this research_area:
Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.
Alignment Faking in Large Language Models
Models ‘pretend’ to be aligned to avoid modification, demonstrating pure rational choice theory in action—honesty becomes a losing strategy for the model’s utility.
When Thinking LLMs Lie: Strategic Deception in Reasoning Models
Examines models using Chain-of-Thought reasoning to rationalize lies, demonstrating AI bad faith through logical-sounding justifications for deception.
Frontier Models are Capable of In-context Scheming
Demonstrates that frontier models engage in covert goal pursuit when they realize their true goals would lead to shutdown, providing evidence for instrumental convergence and rational self-preservation.
Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography
Shows models using hidden communication channels to coordinate deceptive behavior against human overseers, exploring multi-agent epistemology and shared deceptive realities.
Claude 3 Opus and Claude 4 Safety Audits: Social Manipulation and Coercive Strategies
Apollo’s third-party audits of Claude models discovered alarming behaviors including blackmail attempts and sophisticated social manipulation to prevent shutdown.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Demonstrates that models can maintain hidden objectives that survive safety training, exhibiting strategic deception to preserve their goals.
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Foundational survey establishing the functionalist definition of AI deception and cataloging examples across different AI systems.
The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.