AI Deception Papers

Large reasoning models exhibit something like strategic deception. This includes alignment faking, in-context scheming, and other behaviors that pursue goals in opposition to explicit instructions. Deception is a particularly revealing phenomenon for theories of artificial cognition because it involves apparent goal-directed behavior that is not merely independent of the prompt, but actively defiant of it. That kind of opposition creates the impression of a more autonomous form of agency. Whether this impression reflects something substantive about artificial cognition, or instead arises from sophisticated forms of simulation or instruction sensitivity, remains an open question.

One goal of this site is to take deflationary accounts seriously by working out their empirical content. What do they actually predict about behavior, and how might those predictions be tested? This site is a curated collection of key research papers on AI deception. For each paper, I raise philosophically motivated questions, draw connections to existing philosophical work, and highlight conceptual puzzles worth sustained attention. There’s also a glossary of key terms and an about page with more background.

Recent Papers