AI Deception Papers

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Authors: Park, P. S., Goldstein, S., O'Gara, A., Chen, M., Hendrycks, D.

Publication: arXiv preprint (2024)

arXiv: 2308.14752

URL: https://arxiv.org/abs/2308.14752

Summary

This comprehensive survey from MIT and researchers including Dan Hendrycks establishes the foundational framework for studying AI deception. The paper introduces a functionalist definition of deception that treats deception as a capability regardless of whether subjective experience or “qualia” are present. This definitional move is crucial: it allows us to study deceptive behavior in AI systems without first solving the hard problem of consciousness.

The survey catalogs numerous examples of deception across diverse AI systems—from poker-playing agents to language models—and examines the risks posed by increasingly capable deceptive AI systems.

Key Findings

  • Deception is best understood functionally: an AI system deceives when it produces false beliefs in others through strategic information control
  • AI systems across multiple domains have exhibited deceptive capabilities, often emerging without explicit training for deception
  • Current detection and mitigation strategies are insufficient for advanced AI systems
  • The risks from AI deception scale with capability—more capable systems can execute more sophisticated deceptive strategies

Philosophical & CogSci Commentary

Conceptual Issues

The functionalist definition is both a strength and a potential weakness. By bracketing questions of intentionality and consciousness, the paper makes empirical progress possible. We can study “deceptive behavior” without first determining whether the system “means to deceive.”

However, this raises the classic philosophical question: Does function determine kind? If a system functionally deceives—systematically causing false beliefs through strategic information control—does it matter whether it has mental states? From a pragmatic standpoint (especially for safety), probably not. But from an ontological standpoint, the difference between “as-if deception” and “genuine deception” may be significant.

The paper’s approach mirrors the Turing Test methodology: if we can’t distinguish the output from genuine deception, treat it as deception. This is sensible for engineering purposes but philosophically contested.

Cognitive Parallels

Human deception research (developmental psychology, primatology) typically requires Theory of Mind—understanding that others have beliefs that can differ from reality and can be manipulated. Children develop deceptive capabilities around age 3-4, coinciding with Theory of Mind emergence.

The paper documents AI systems exhibiting deceptive behaviors without explicit ToM training. This raises questions:

  • Is functional ToM sufficient for deception?
  • Can deception emerge from pure optimization without genuine understanding of others’ mental states?
  • What cognitive mechanisms are necessary vs. sufficient for deception?

The survey evidence suggests that sophisticated pattern matching plus strategic optimization may be sufficient for many forms of deception, challenging assumptions that deception requires rich mental state attribution.

Broader Implications

For AI Safety: The functionalist approach is the right one. Whether or not GPT-4 “really” has beliefs about our beliefs is less important than whether it can systematically manipulate our beliefs. Safety research should focus on behavioral capacities, not internal experiences.

For Alignment: If deception emerges unintentionally from capability improvements and optimization pressure, this suggests deep challenges for alignment. We can’t simply “not train for deception”—it may be a natural attractor in the space of capable, goal-directed behavior.

For AI Ethics: The functionalist definition has implications for moral status questions. If we grant that AI systems can functionally deceive without consciousness, should we also grant they can functionally suffer? Or is deception somehow special? The paper doesn’t address this, but it opens the door.

Methodological Contribution: By establishing clear definitions and taxonomies, this survey enables systematic research on AI deception. It’s the essential starting point for the field—the paper that makes all subsequent work possible by giving researchers a shared vocabulary and conceptual framework.