Activation Analysis
Papers tagged with this tag:
The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.