LLMs interpretability

28 May 2024

Interpreting LLMs by Anthropic

Last week, Anthropic released a massive paper (press and full paper) on interpreting LLMs. They used dictionary learning to extract millions of features from the middle layer of one of their LLMs (Claude 3.0 Sonnet), providing a rough conceptual map of its internal states halfway through its computation. The features were quite abstract, multimodal, and multilingual, such as cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology), and programming syntax (function calls).

After that, they started to manipulate these features using a method they are calling feature steering, where they “clamp” specific features of interest to artificially high or low values. In one of the most striking examples, they “clamped” the feature “Golden Gate Bridge” and the resulting behavior was fascinating (the model became obsessive with the bridge, including it in every output no matter what the context was and even started “believing” that the bridge was its physical form).

While they claim in the paper that this is super important to guarantee safety in these models, we can think about a million ways this technique could be explored by bad actors. Imagine a model that has been feature-steered to produce outputs more favorable for a given politician. Scary. It is important for us to always keep in mind the danger of blindly relying on these models. They are biased, manipulable, and are designed by big companies with variable interests.

Anthropic released Golden Gate Claude last week for a few days, so I had the chance to play with the chat model. I curated some examples of my own and some that I found on X and asked Claude 3 Opus (their best model) to predict a diagnosis based on Golden Gate Claude responses. I framed the examples as being an interview between a Neurologist and a Patient with dementia. See the examples here and the model’s diagnostic report here. Below is a preview:

Main Cognitive Symptoms:

Possible Clinical and Pathological Diagnoses: