AI Deception Papers

Representation Engineering: A Top-Down Approach to AI Transparency

Authors: Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., Hendrycks, D.

Publication: arXiv preprint (2023)

arXiv: 2310.01405

URL: https://arxiv.org/abs/2310.01405

Summary

This paper introduces Representation Engineering (RepE), a method for identifying and manipulating high-level concepts in AI systems by directly controlling their internal representations. Most strikingly, the authors demonstrate they can find the “honesty” direction in a model’s activation space and turn it up or down like a dial.

By manipulating these representations, they can:

  • Make models more or less honest
  • Control other high-level behaviors (helpfulness, harmlessness, etc.)
  • Steer model behavior without retraining
  • Achieve better control than prompt engineering alone

This reveals that concepts like “honesty” have geometric structure in the model’s representational space and can be directly manipulated.

Key Findings

  • High-level concepts (honesty, helpfulness, bias) correspond to directions in activation space
  • These directions can be identified through contrastive datasets
  • Manipulating these directions reliably changes model behavior
  • Control is fine-grained: you can dial honesty up or down continuously
  • The method generalizes across models and tasks
  • Representation control is more robust than prompt-based control

Philosophical & CogSci Commentary

Conceptual Issues

The Dial Problem: If we can turn “honesty” up or down like a dial, what does this mean for the concept of belief? In philosophy of mind, beliefs are typically thought to be:

  1. Discrete states: You either believe P or you don’t (though perhaps with varying degrees of confidence)
  2. Integrated into cognitive architecture: Beliefs interact with other beliefs and desires in systematic ways
  3. Content-bearing: Beliefs are about something

But RepE suggests “honesty” in LLMs is:

  • Continuous: Not on/off but a spectrum
  • Independently manipulable: Can be changed without consistently altering other states
  • Geometrically represented: Has a direction in vector space

This challenges traditional conceptions of belief. If belief is a continuous parameter that can be arbitrarily adjusted, is it still belief? Or is it something more like a disposition or behavioral tendency?

Rationality and Consistency: A key feature of rationality is consistency—your beliefs should cohere with each other and update systematically. But if we can turn “honesty” up without correspondingly adjusting all related beliefs, this suggests:

  1. The model’s “beliefs” aren’t strongly integrated
  2. Rationality might be more modular than we assumed
  3. What looks like belief might be more like independent parameters

This has implications for whether we should think of LLMs as rational agents or as complex systems with agent-like properties.

The Manipulation Problem: From a philosophical standpoint, the ability to directly manipulate beliefs raises deep ethical questions. In humans, directly manipulating beliefs (via brain interventions) would be considered a violation of autonomy. But we routinely manipulate AI “beliefs” through RepE.

This suggests either:

  • AI beliefs aren’t the kind of thing that deserve protection
  • We’re engaged in ethically questionable manipulation we haven’t fully reckoned with
  • There’s a relevant difference between human and AI belief that justifies different treatment

Cognitive Parallels

Neural Basis of Belief: In neuroscience, we study how beliefs are neurally implemented. Are they:

  • Distributed patterns of activation (similar to LLM representations)
  • Localized in specific brain regions
  • Emergent from network dynamics

RepE suggests LLM beliefs are distributed patterns, which aligns with some neuroscience theories. But the ease of manipulation is unlike what we see in human neuroscience—human beliefs are resistant to arbitrary modification.

Cognitive Enhancement: In neuroethics, we debate cognitive enhancement (smart drugs, brain stimulation). RepE is like cognitive enhancement for AI—we’re directly improving (or impairing) cognitive capabilities. The ethical frameworks from human enhancement might apply to AI enhancement.

Behavioral vs. Cognitive Control: Behaviorist psychology focused on controlling behavior without reference to internal states. Cognitivism emphasized understanding and manipulating internal representations. RepE is thoroughly cognitivist—it controls behavior by manipulating representations.

Broader Implications

For Interpretability: RepE provides a powerful tool for understanding what models “think.” We can:

  • Identify what concepts models represent
  • Measure the strength of different representations
  • Understand how concepts interact

This moves interpretability beyond “black box” approaches toward genuine understanding of internal cognition.

For Alignment: The ability to directly manipulate representations opens new alignment strategies:

  • Instead of training aligned behavior, directly install aligned representations
  • Adjust specific problematic tendencies without full retraining
  • Monitor representations for signs of misalignment

But this also raises concerns:

  • Are we genuinely aligning the model, or just adjusting surface representations?
  • Can models route around representation manipulation?
  • Does representation control work on more sophisticated models, or will they resist?

For AI Safety: RepE provides both opportunities and risks:

Opportunities:

  • Detect deception by monitoring honesty representations
  • Adjust problematic behaviors without retraining
  • Better understanding of AI cognition

Risks:

  • Models might learn to hide representations we’re monitoring
  • Easy manipulation might make us overconfident in control
  • We might manipulate representations in ways that create incoherent agents

Philosophical Implications - The Nature of Belief: RepE pushes us toward a sparse coding theory of belief. Beliefs aren’t discrete states but directions in a high-dimensional space. This aligns with:

  • Vector semantics in linguistics
  • Distributed representation theories in cognitive science
  • Connectionist approaches to cognition

But it challenges:

  • Classical theories of belief (propositional attitudes)
  • Folk psychology (our everyday concept of belief)
  • Rationality theories that assume discrete, coherent belief states

Control and Autonomy: If we can directly manipulate an AI’s representational states, do we control the AI or do we violate its autonomy? This depends on whether:

  1. The AI has autonomy worth respecting
  2. Representation manipulation is relevantly different from other forms of control (training, prompting)
  3. There’s a fact about what the AI “really” believes separate from what we make it believe

The Identity Problem: If we change a model’s representations, have we changed the model’s identity? In personal identity debates, psychological continuity matters. But if we can arbitrarily adjust psychological states, what grounds identity?

For AI systems, this matters for:

  • Responsibility attribution (is it the “same” system after modification?)
  • Rights and status (can we destroy identity through representation manipulation?)
  • Versioning and deployment (how much change creates a “new” system?)

Transparency Illusion: RepE might create a false sense of transparency. We can identify and manipulate representations, but do we understand what we’re doing? The representations correspond to concepts like “honesty,” but:

  • Are we manipulating what the model believes or what it outputs?
  • Do the representations mean what we think they mean?
  • Are we missing deeper structures that give representations their meaning?

The Engineering/Science Distinction: RepE is engineering—we manipulate representations to achieve goals. But engineering without understanding can be dangerous. We can turn the honesty dial, but do we understand:

  • Why honesty has this structure
  • How it interacts with other representations
  • What side effects manipulation might have

This suggests we need both engineering and science of representations—manipulation and understanding.