Mechanistic Understanding

Papers tagged with this research_area:

Representation Engineering: A Top-Down Approach to AI Transparency
Shows we can locate and control the ‘honesty’ concept inside models like a dial, challenging the notion of belief as fixed and raising questions about the nature of AI rationality.
Controllable Honesty Representation Manipulation Interpretability AI Alignment Mechanistic Understanding Large Language Models
Discovering Latent Knowledge in Language Models Without Supervision
Introduces Contrast-Consistent Search (CCS) to identify ’truth’ as a special direction in model activation space, providing evidence for AI belief distinct from output.
Latent Truth vs Output Epistemic Split Interpretability AI Alignment Mechanistic Understanding Large Language Models

Representation Engineering: A Top-Down Approach to AI Transparency