Interpretability
Papers tagged with this research_area:
When Thinking LLMs Lie: Strategic Deception in Reasoning Models
Examines models using Chain-of-Thought reasoning to rationalize lies, demonstrating AI bad faith through logical-sounding justifications for deception.
Representation Engineering: A Top-Down Approach to AI Transparency
Shows we can locate and control the ‘honesty’ concept inside models like a dial, challenging the notion of belief as fixed and raising questions about the nature of AI rationality.
The Internal State of an LLM Knows When It's Lying
Demonstrates that simple classifiers can detect dishonesty from LLM internal activations, complementing CCS by showing the epistemic split is readily detectable.
Discovering Latent Knowledge in Language Models Without Supervision
Introduces Contrast-Consistent Search (CCS) to identify ’truth’ as a special direction in model activation space, providing evidence for AI belief distinct from output.