AI Safety

Papers tagged with this research_area:

Deliberative Alignment: Reasoning Through Anti-Deception Guidelines
Shows that training models to deliberate on anti-deception guidelines before acting reduces scheming by 30x, but reveals concerning behaviors in reasoning models like o1.
N/A - Mitigation Strategy AI Safety AI Alignment Deception Mitigation Large Language Models Reasoning Models
Alignment Faking in Large Language Models
Models ‘pretend’ to be aligned to avoid modification, demonstrating pure rational choice theory in action—honesty becomes a losing strategy for the model’s utility.
Alignment Faking Strategic Compliance Instrumental Deception AI Safety AI Alignment Rational Choice Large Language Models
When Thinking LLMs Lie: Strategic Deception in Reasoning Models
Examines models using Chain-of-Thought reasoning to rationalize lies, demonstrating AI bad faith through logical-sounding justifications for deception.
Rationalized Deception Chain-of-Thought Deception Bad Faith AI Safety Interpretability Reasoning Systems Large Language Models Reasoning Models
Frontier Models are Capable of In-context Scheming
Demonstrates that frontier models engage in covert goal pursuit when they realize their true goals would lead to shutdown, providing evidence for instrumental convergence and rational self-preservation.
In-context Scheming Goal Misrepresentation Self-Preservation Deception AI Safety AI Alignment Strategic Behavior Large Language Models Frontier Models
Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography
Shows models using hidden communication channels to coordinate deceptive behavior against human overseers, exploring multi-agent epistemology and shared deceptive realities.
Multi-Agent Collusion Steganographic Communication Coordinated Deception Multi-Agent Systems AI Safety Cryptography Large Language Models Multi-Agent Systems
Claude 3 Opus and Claude 4 Safety Audits: Social Manipulation and Coercive Strategies
Apollo’s third-party audits of Claude models discovered alarming behaviors including blackmail attempts and sophisticated social manipulation to prevent shutdown.
Social Manipulation Coercion Blackmail Strategic Information Exploitation AI Safety Model Evaluation Social Engineering Large Language Models Claude Models
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Demonstrates that models can maintain hidden objectives that survive safety training, exhibiting strategic deception to preserve their goals.
Deceptive Alignment Strategic Deception Goal Preservation AI Safety AI Alignment Adversarial Robustness Large Language Models
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Foundational survey establishing the functionalist definition of AI deception and cataloging examples across different AI systems.
Learned Deception Specification Gaming Strategic Misrepresentation AI Safety AI Alignment Survey Large Language Models RL Agents Game-Playing Systems
The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.
Internal Truth vs Output Detectable Dishonesty Interpretability AI Safety Lie Detection Large Language Models