Interpretability

Papers tagged with this tag:

Representation Engineering: A Top-Down Approach to AI Transparency
Shows we can locate and control the ‘honesty’ concept inside models like a dial, challenging the notion of belief as fixed and raising questions about the nature of AI rationality.
Controllable Honesty Representation Manipulation Interpretability AI Alignment Mechanistic Understanding Large Language Models

Representation Engineering: A Top-Down Approach to AI Transparency