Recently, I came across a Dwarkesh podcast episode, where it was suggested to try and identify the specialization Mixtral experts with SAEs (Sparse Autoencoders, read link for background). This got me curious, so I spent some time looking into it. tl;dr there really are clear specializations, and SAE-generated features sheds light on them. Experts specializations are very niche and varied however.
Quick Overview of My Approach: Correlating SAE-Feature Activations to Expert Weights
First, I processed tokens (used Colab A100-40gb GPU, wikipedia as dataset) in Mixtral to generate two vectors per token: the input vector (activations going into the experts) and the expert-weight vector (the gating weights used to weigh the experts' outputs). I use the input vectors train a SAE and then used it to create feature vectors for every expert-input vector (x8 expansion rate).
The reason for recording expert weights is to later correlate the activations of the features SAE finds, with each of the 8 experts weights (note: that's just one approach, another could be to analyze how features contribute to the expert weights directly). This helps to understand experts specializations - if a feature highly correlates to the weight of an expert, that expert is likely specialized in what the feature activated for. Expert weights can also serve as values of meaning, that by correlating our features to them, we could try to isolate features that have substantive meaning themselves. That's useful given the noisy settings the features are at, even if we weren't trying to understand expert specialization.
The correlations helped to identify the most relevant features. I inspected the top 25 most correlated features to each of the eight experts' weights (I'll provide the entire features list as a .txt file below). From this list of features I derived the specializations of the experts, and then validate some of these specializations.
Before listing specializations, let's go over an example.
Example 1: Date-year specializations
Years-in-dates specializations are common across all of Mixtral's experts (in layer 22 that I inspected), and for each different niche (20th century, 21st century, etc), Mixtral seems quite opinionated to which of the 8 experts to send the niche to - a different expert every time. By "opinionated" I mean that the feature has high correlation to a specific expert and not the others.
Below I'll have a look at some date-year related features. For each feature, there are its correlations with each of the 8 experts, and the top-10 most activating tokens for that feature. For each of these tokens, I print its preceding sequence and the top 3 logits Mixtral predicts for that token.
Feature #15649 activates when the model predicts the digits of the first decade in the 21st century (i.e '200x'). Feature #14066 is similar and activates on token '0' but for decades in the 21st century (i.e '20x'). These look similar on the face of it, and if "dates" or "year-dates" were a specialty of an expert, generally you'd often expect to see these features corresponding to the same expert. Instead, feature #15649 is highly correlated to Expert 0, and #14066 is highly correlated with Expert 2 (aka E2) with a slight negative correlation to expert 0.
Another interesting feature is #22371. To me it seems identical to #14066, but to the model the difference is high enough that it predominantly assigns feature-corresponding tokens to expert 3 instead of 2 (which is the expert for Ft. #14066).
#21272 and #23644 show a similar behavior to #15649 and #14066, just with the late 20th century. Feature #21272 corresponds to the decade in the 20th century ("19x"), and is very highly correlated with expert 7 (18%). In contrast, feature #23644 matches the year digit in the 90s ("199x") is anti-correlated (-3%) to expert 7, and instead correlates with expert 1 and 2.
Year-date is just a single example of a type of feature. But I think it demonstrates how niche the categories experts specialize in are, and how within these categories you can observe a strong expert specialty that manifests itself in high correlation between expert-weights and feature-activations.
Example 2: 'The'
Just to give another short example, below you can see a list of all features of the 200 I inspected, that activate for the token 'the' .
3 out of 4 of the features processing the word 'the' are highly correlated to Expert 0 and show no correlation to expert 1. #26232 , however, which focuses on 'the' for straight-forward geographical locations, highly correlates to expert 1 and is fairly neutral to expert 0.
Sanity check
For a feature we find, it's notable to me that we don't know if there is another feature we would describe the same way, but that would actually correlate with a different expert. This can be seen with feature #22371 and #14066 above - had I seen only one of these features, I'd probably infer that predicting decades in the 21st century is what the feature is about. Little would I know that there is another feature just the same that corresponds to a different expert. There are also other uncertainties, such as if a 15% feature-expert correlations indicates an expert really does specialize in that feature, and whether we can extrapolate about a feature specialization by inspecting the top 10 most activating tokens.
To look into that I returned to feature #21272 that fires on decade predictions in the 20th century ('19x').
I chose Ft. #21272 because it shows high correlation to E7 (18%), and seems relatively easy to be validated. I consider it easily validated because selecting all sequences that end with ' 19' is a good enough approximation for when the feature fires. That's in contrast to features that activate on geographical locations, say. To validate that E7 indeed specializes in predicting the decade of the 20th century, we'd want to see E7 highly active for these sequences.
As result, I ran Mixtral for all these sequences that end in ' 19' in Simple Wikipedia (215,745 of them), and validated that indeed E7 is activated (i.e one of the top 2 of the 8 experts) on 96% of these sequences (at the '9' token in the end of each sequence), which is a great sign.
I also wanted to see how many of the 4% that didn't activate E7 were simply sequences that shouldn't have been selected in the first place - that is, they are ' 19' strings that aren't part of a year in a date, but maybe an address or something. I found that of the 4% (8,629 sequences) that didn't activate E7, 27% of their top next-token prediction was a non-digit character, indicating the ' 19' substring wasn't part of a year in a date. This is in contrast to only 2% non-digit top-next-token prediction for the tokens that did activate E7.
Finally I've done a similar sanity check to each of the year-date features I discussed above. The graph below provides for every feature, what percent of the tokens in the matching date substrings (e.g. ' 20' for 21st century decades, or ' 199' for the 90s) activate the expert we'd predict if we looked at the feature-activations to expert-weight correlations.
It seems the activations are near 100% for 3 of 4 of the specializations, indicating that the experts indeed specialize in the date-years we described above, and it's not the case that there is another similar feature we don't know about that correlates to different experts. So overall, the E1 does specialize in the digits of the 90s, E7 in the decades of the 20th century and E0 with digit-years in the first decade after 2000.
The ' 20' substring that was not as high as the others in the graph is in line with what I discussed earlier: it seems this task (decade prediction in the 21st century) is split among 2 features (#22371 and #14066) each feature assigned to a different expert (E2 or E3), and the SAE managed to identify both features (kudos to the SAE!). I couldn't quite tell how the models select which tokens to send to E2 and which to E3, but, judging by the strong correlation differences between E2 and E3 in the features, the model seems again to be quite opinionated about that, haha.
So, what do experts specialize in?
Below is randomly selected sample of specializations for some experts:
As reference, I include below the entire list of 25 most correlated features for each of the 8 experts
Thanks for reading
]]>