Sharing what I presented at the reading circle on November 13.

Presentation Summary

Paper: Emergent Introspective Awareness in Large Language Models
Author: Jack Lindsey (Anthropic)
Summary: The paper asks whether LLMs can be introspective, and addresses this by observing changes in internal model states rather than input/output text. It runs four main kinds of experiments (e.g., injecting concept vectors during inference and asking the model which concepts were or were not injected). On multiple Claude models, Opus 4 and 4.1 showed a higher proportion of introspective-like behavior than others. The authors suggest that introspective capability may be linked to post-training and overall model performance.

Slides

This was one of the three most interesting papers I read this year. I recommend it if the topic interests you.
I feel I'm getting a better sense of how to read papers with an eye to turning them into slides. I'm gradually able to skim papers more efficiently and am happy to see the progress.
A lot of Mechanistic Interpretability work has that “wait, is that really what’s going on inside the model?” appeal, and I find it very enjoyable to read.
I’d like to keep sharing papers in this area in the future.