Wes Gurnee (@wesg52) / X

Wes Gurnee

225 posts

Wes Gurnee

@wesg52

Trying to read Claude’s mind. Interpretability at @AnthropicAI Prev: Optimizer @MIT, Byte-counter @Google

San Francisco, CA

Joined June 2022

Pinned
Wes Gurnee
@wesg52
Oct 21, 2025
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
461K
Wes Gurnee
@wesg52
Oct 4, 2023
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
GIF
3.3M
Wes Gurnee
@wesg52
May 3, 2023
Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out arxiv.org/abs/2305.01610. A 🧵:
228K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
For spatial representations, we run Llama-2 models on the names of tens of thousands cities, structures, and natural landmarks around the world, the USA, and NYC. We then train linear probes on the last token activations to predict the real latitude and longitudes of each place.
GIF
139K
Wes Gurnee
@wesg52
Jan 23, 2024
New paper! "Universal Neurons in GPT2 Language Models" How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below 🧵:
76K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
To see all the details and additional validations check out the Paper: arxiv.org/abs/2310.02207 Code and datasets: github.com/wesg52/world-m…
38K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
For temporal representations, we run the models on the names of famous figures from the past 3000 years, the names of songs, movies and books from 1950 onward, and NYT headlines from the 2010s and train lin probes to predict the year of death, release date, and publication date.
54K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
But does the model actually _use_ these representations? By looking for neurons with similar weights as the probe, we find many space and time neurons which are sensitive to the spacetime coords of an entity, showing the model actually learned the global geometry -- not the probe
38K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
When training probes over every layer and model, we find that representations emerge gradually over the early layers before plateauing at around the halfway point. As expected, bigger models are better, but for more obscure datasets (NYC) no model is great.
48K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
Are these representations actually linear? By comparing the performance of nonlinear MLP probes with linear probes, we find evidence that they are! More complicated probes do not perform any better on the test set.
51K
Wes Gurnee
@wesg52
Oct 21, 2025
Replying to @wesg52
The task is simply when to break a line in fixed width text. This requires the model to in-context learn the line width constraint, state track the characters in the current line, compute the characters remaining, and determine if the next word fits!
20K
Wes Gurnee
@wesg52
Mar 27, 2025
We tried to build a “microscope” to understand how Claude works. There are still many things which we cannot see clearly, but there are many exciting things that are coming into focus! A few reflections and exciting results:
Anthropic
@AnthropicAI
Mar 27, 2025
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
00:00
9.2K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
Are these representations robust to prompting? Probing on different prompts we find performance is largely preserved but can be degraded by capitalizing the entity name or prepending random tokens. Also probing on the trailing period instead of last token is better for headlines
41K
Wes Gurnee
@wesg52
Oct 4, 2023
Replying to @wesg52
Finally, special shoutout to @NeelNanda5 for all the feedback on the paper and project!
32K