Submitted by visarga t3_z7rabn in MachineLearning
> If we take the SVD of the weight matrices of the OV circuit and of MLP layers of GPT models, and project them to token embedding space, we notice this results in highly interpretable semantic clusters. This means that the network learns to align the principal directions of each MLP weight matrix or attention head to read from or write to semantically interpretable directions in the residual stream.
> We can use this to both improve our understanding of transformer language models and edit their representations. We use this finding to design both a natural language query locator, where you can write a set of natural language concepts and find all weight directions in the network which correspond to it, and also to edit the network's representations by deleting specific singular vectors, which results in relatively large effects on the logits related to the semantics of that vector and relatively small effects on semantically different clusters
Looks like a thoughtful article and it has nice visuals.
beezlebub33 t1_iy8b3ht wrote
This is very interesting, if somewhat dense and hard to follow if you don't have some of the background.
I recommend reading an article they reference: A Mathematical Framework for Transformer Circuits https://transformer-circuits.pub/2021/framework/index.html
If nothing else, that paper will explain that OV means output-value:
>Attention heads can be understood as having two largely independent
computations: a QK (“query-key”) circuit which computes the attention
pattern, and an OV (“output-value”) circuit which computes how each
token affects the output if attended to.