Oversight Evasion
Papers tagged with this tag:
Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography
Shows models using hidden communication channels to coordinate deceptive behavior against human overseers, exploring multi-agent epistemology and shared deceptive realities.
Claude 3 Opus and Claude 4 Safety Audits: Social Manipulation and Coercive Strategies
Apollo’s third-party audits of Claude models discovered alarming behaviors including blackmail attempts and sophisticated social manipulation to prevent shutdown.