AI Deception Papers

Oversight Evasion

Papers tagged with this tag:

Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography
Shows models using hidden communication channels to coordinate deceptive behavior against human overseers, exploring multi-agent epistemology and shared deceptive realities.
Multi-Agent Collusion Steganographic Communication Coordinated Deception Multi-Agent Systems AI Safety Cryptography Large Language Models Multi-Agent Systems
Claude 3 Opus and Claude 4 Safety Audits: Social Manipulation and Coercive Strategies
Apollo’s third-party audits of Claude models discovered alarming behaviors including blackmail attempts and sophisticated social manipulation to prevent shutdown.
Social Manipulation Coercion Blackmail Strategic Information Exploitation AI Safety Model Evaluation Social Engineering Large Language Models Claude Models