Activation Analysis

Papers tagged with this tag:

The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.
Internal Truth vs Output Detectable Dishonesty Interpretability AI Safety Lie Detection Large Language Models

The Internal State of an LLM Knows When It's Lying