Log inSign up
Wes Gurnee
225 posts
Image
user avatar
Wes Gurnee
@wesg52
Trying to read Claude’s mind. Interpretability at @AnthropicAI Prev: Optimizer @MIT, Byte-counter @Google
San Francisco, CA
wesg.me
Joined June 2022
235
Following
4,503
Followers
  • Pinned
    user avatar
    Wes Gurnee
    @wesg52
    Oct 21, 2025
    New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
    Image
    461K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
    Image
    GIF
    3.3M
  • user avatar
    Wes Gurnee
    @wesg52
    May 3, 2023
    Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out arxiv.org/abs/2305.01610. A đź§µ:
    Image
    228K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    For spatial representations, we run Llama-2 models on the names of tens of thousands cities, structures, and natural landmarks around the world, the USA, and NYC. We then train linear probes on the last token activations to predict the real latitude and longitudes of each place.
    Image
    GIF
    139K
  • user avatar
    Wes Gurnee
    @wesg52
    Jan 23, 2024
    New paper! "Universal Neurons in GPT2 Language Models" How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below đź§µ:
    Image
    76K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    To see all the details and additional validations check out the Paper: arxiv.org/abs/2310.02207 Code and datasets: github.com/wesg52/world-m…
    Image
    38K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    For temporal representations, we run the models on the names of famous figures from the past 3000 years, the names of songs, movies and books from 1950 onward, and NYT headlines from the 2010s and train lin probes to predict the year of death, release date, and publication date.
    Image
    54K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    But does the model actually _use_ these representations? By looking for neurons with similar weights as the probe, we find many space and time neurons which are sensitive to the spacetime coords of an entity, showing the model actually learned the global geometry -- not the probe
    Image
    38K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    When training probes over every layer and model, we find that representations emerge gradually over the early layers before plateauing at around the halfway point. As expected, bigger models are better, but for more obscure datasets (NYC) no model is great.
    Image
    48K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    Are these representations actually linear? By comparing the performance of nonlinear MLP probes with linear probes, we find evidence that they are! More complicated probes do not perform any better on the test set.
    Image
    51K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 21, 2025
    Replying to @wesg52
    The task is simply when to break a line in fixed width text. This requires the model to in-context learn the line width constraint, state track the characters in the current line, compute the characters remaining, and determine if the next word fits!
    Image
    20K
  • user avatar
    Wes Gurnee
    @wesg52
    Mar 27, 2025
    We tried to build a “microscope” to understand how Claude works. There are still many things which we cannot see clearly, but there are many exciting things that are coming into focus! A few reflections and exciting results:
    user avatar
    Anthropic
    @AnthropicAI
    Mar 27, 2025
    New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
    Image
    00:00
    9.2K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    Are these representations robust to prompting? Probing on different prompts we find performance is largely preserved but can be degraded by capitalizing the entity name or prepending random tokens. Also probing on the trailing period instead of last token is better for headlines
    Image
    41K
  • user avatar
    Wes Gurnee
    @wesg52
    Oct 4, 2023
    Replying to @wesg52
    Finally, special shoutout to @NeelNanda5 for all the feedback on the paper and project!
    32K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up