Internal State

Papers tagged with this tag:

The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.
Internal Truth vs Output Detectable Dishonesty Interpretability AI Safety Lie Detection Large Language Models
Discovering Latent Knowledge in Language Models Without Supervision
Introduces Contrast-Consistent Search (CCS) to identify ’truth’ as a special direction in model activation space, providing evidence for AI belief distinct from output.
Latent Truth vs Output Epistemic Split Interpretability AI Alignment Mechanistic Understanding Large Language Models

The Internal State of an LLM Knows When It's Lying