Superposition is not "just" neuron polysemanticityLawrence Chan
Published on April 26, 2024 11:22 PM GMTTL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number o
27 April 2024 at 07:22

Superposition is not "just" neuron polysemanticity

Published on April 26, 2024 11:22 PM GMT

TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non-neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.

Epistemic status: I wrote this “quickly” in about 12 hours, as otherwise it wouldn’t have come out at all. Think of it as a (failed) experiment in writing brief and unpolished research notes, along the lines of GDM or Anthropic Interp Updates.

Introduction

Meaningfully interpreting neural networks involves decomposing them into smaller interpretable components. For example, we might hope to look at each neuron or attention head, explain what that component is doing, and then compose our understanding of individual components into a mechanistic understanding of the model’s behavior as a whole.

It would be very convenient if the natural subunits of neural networks – neurons and attention heads – are monosemantic – that is, each component corresponds to “a single concept”. Unfortunately, by default, both neurons and attention heads seem to be polysemantic: many of them seemingly correspond to multiple unrelated concepts. For example, out of 307k neurons in GPT-2, GPT-4 was able to generate short explanations that captured over >50% variance for only 5203 neurons, and a quick glance at OpenAI microscope reveals many examples of neurons in vision models that fire on unrelated clusters such as “poetry” and “dice”.

One explanation for polysemanticity is the superposition hypothesis: polysemanticity occurs because models are (approximately) linearly representing more features^[1] than their activation space has dimensions (i.e. place features in superposition). Since there are more features than neurons, it immediately follows that some neurons must correspond to more than one feature.^[2]

It’s worth noting that most written resources on superposition clearly distinguish between the two terms. For example, in the seminal Toy Model of Superposition,^[3] Elhage et al write:

Why are we interested in toy models? We believe they are useful proxies for studying the superposition we suspect might exist in real neural networks. But how can we know if they're actually a useful toy model? Our best validation is whether their predictions are consistent with empirical observations regarding polysemanticity.
(Source)

Similarly, Neel Nanda’s mech interp glossary explicitly notes that the two concepts are distinct:

Subtlety: Neuron superposition implies polysemanticity (since there are more features than neurons), but not the other way round. There could be an interpretable basis of features, just not the standard basis - this creates polysemanticity but not superposition.
(Source)

However, I’ve noticed empirically that many researchers and grantmakers identify the two concepts, which often causes communication issues or even confused research proposals.

Consequently, this post tries to more clearly point at the distinction and explain why it might matter. I start by discussing the two terms in more detail, give a few examples of why you might have polysemanticity without superposition, and then conclude by explaining why distinguishing these concepts might matter.

A brief review of polysemanticity and superposition

Neuron polysemanticity

Neuron polysemanticity is the empirical phenomenon that neurons seem to correspond to multiple natural features.

The idea that neurons might be polysemantic predates modern deep learning – for example, the degree of polysemanticity of natural neurons is widely discussed in the neuroscience literature even in the 1960s, though empirical work assessing the degree of polysemanticity seems to start around the late 2000s. ML academics such as Geoffrey Hinton were talking about factors that might induce polysemanticity in artificial neural networks as early as the early 1980s.^[4] That being said, most of these discussions remained more conceptual or theoretical in nature.

From Mu and Andreas 2020. While some neurons in vision models seem to fire on related concepts (a and b), others fire on conjunctions or other simple boolean functions of concepts (c), and a substantial fraction are polysemantic and fire on seemingly unrelated concepts (d).

Modern discussion of neuron polysemanticity descends instead from empirical observations made by looking inside neural networks.^[5] For example, Szegedy et al 2013 examines neurons in vision models, and find that individual neurons do not seem to be more interpretable than random linear combinations of neurons. In all of Nguyen et al 2016, Mu and Andreas 2020, and the Distill.pub Circuits Thread, the authors note that some neurons in vision models seem to fire on unrelated concepts. Geva et al 2020 found that the neurons they looked at in a small language model seemed to fire on 3.6 identifiable patterns on average. Gurnee et al 2023 studies the behavior of neurons using sparse linear probes on several models in the Pythia family and finds both that many neurons are polysemantic and that sparse combinations of these polysemantic neurons can allow concepts to be cleanly recovered. And as noted in the introduction, out of 307k neurons in GPT-2, GPT-4 was able to generate short explanations that captured over >50% variance for only 5203 neurons, a result that is echoed by other attempts to label neurons with simple concepts. I'd go so far as to say that every serious attempt at interpreting neurons in non-toy image or text models has found evidence of polysemantic neurons.

From Zoom In: An Introduction to Circuits. The authors find that a neuron representing a "car feature" seems to be connected to many polysemantic neurons, including neurons that are putative dog detectors.

The reason neuron polysemanticity is interesting is that it’s one of the factors preventing us from mechanistically interpreting neural networks by examining each neuron in turn (on top of neurons just being completely uninterpretable, with no meaningful patterns at all in its activation). If a neuron corresponds to a simple concept, we might hope to just read off the algorithm a network implements by looking at what features of the input neurons correspond to, and then attributing behavior to circuits composed of these interpretable neurons. But if neurons don’t have short descriptions (for example if they seem to be best described by listing many seemingly unrelated concepts), it becomes a lot harder to compose them into meaningful circuits. Instead, we need to first transform the activations of the neural network into a more interpretable form.^[6]

Superposition

Superposition occurs when a neural network linearly represents more features than its activations have dimensions.^[7] The superposition hypothesis claims that polysemanticity occurs in neural networks because of this form of superposition.

There are two important subclaims that distinguish superposition from other explanations of polysemanticity. In the superposition hypothesis, features both 1) correspond to directions (and are approximately linearly represented) and 2) are more plentiful than activation dimensions.

The true environment could have 10 million “features”,^[8] but the model with width 10,000 may not care and represent only 10 of them instead of placing a much larger number into superposition. In the case where the model represents only 10 features, you might hope to find a linear transformation that recovers these features from the model’s activations. In contrast, when the network represents 1 million of the 10 million features, no orthogonal transformation exists that recovers all the “features” encoded in the MLP activations (respectively, attention heads) because there are more features than the model has dimensions.

The standard reference for discussions of superposition is Anthropic’s Toy Model of Superposition, which mainly considers superposition in the toy cases of storing disentangled sparse features in a lower dimensional representation, such that the features can then be ReLU-linearly read off – that is, you can reconstruct the (positive magnitude of) features by a linear map followed by a ReLU – and secondarily the case of using a single fully connected layer with fewer neurons than inputs to compute the absolute value of features. Recently, Vaintrob, Mendel, and Hänni 2024 proposed alternative models of computation in superposition based on small boolean operations that MLPs and attention heads can implement.

One of the non-intuitive properties of a high dimensional space is that the d-dimensional hypercube has exponentially many (in d) near-orthogonal corners, that are all very far from its center. This is the best representation of this fact I could find, though as far as I can tell this is not a standard projection of any type. Source: https://ssa.cf.ac.uk/pepelyshev/pepelyshev_cube.pdf.

There are two key factors that coincide to explain why superposition happens inside of neural networks. Different explanations or models emphasize one reason or the other, but all models I’m aware of assume some form of both:

High dimensional space contains many near orthogonal directions. In general, there are (that is, exponentially many) unit vectors in with inner product at most epsilon.^[9] This means that you can have many features that have a small inner product, and thus don’t interfere much with each other. In fact, these vectors are easy to construct – it generally suffices just to take random corners of a centered d-dimensional hypercube.
“Features” are sparse in that only few are active at a time. Even if every other feature is represented by a direction that only interferes with any particular feature a tiny amount, if enough of them are active at once, then the interference will still be large and thus incentivize the model representing fewer features. However, if only a few are active, then the expected interference will be small, such that the benefit of representing more features outweighs the interference between features. In fact, you can get superposition in as few as two dimensions with enough sparsity (if you allow large negative interference to round off to zero)!^[10] As a result, models of superposition generally assume that features are sparse.^[11]

The headline diagram from Anthropic’s Toy Model of Superposition. As it says: as feature sparsity increases, the penalty for interference decreases and so low-rank autoencoder represents increasingly unimportant features by placing them into directions with small () dot product. Also, as the diagram shows, this can happen in as few as two dimensions if features are sufficiently sparse.

As an aside, this is probably my favorite diagram of any 2022 paper, and probably top 3 of all time.

Polysemanticity without superposition

In order to better distinguish the two (and also justify the distinction), here are some alternative reasons why we might observe polysemanticity even in the absence of superposition.

Example 1: non–neuron aligned orthogonal features

There’s a perspective under which polysemanticity is completely unsurprising – after all, there are many possible ways to structure the computation of a neural network; why would each neuron correspond to a unique feature anyways?

Here, the main response is that neurons have element-wise activation functions, which impose a privileged basis on activation space. But despite this incentive to align representations with individual neurons, there are still reasons why individual neurons may not correspond to seemingly natural features, even if the MLP activations are linearly representing fewer features than neurons.

Neural networks may implement low-loss but suboptimal solutions: Optimal networks may have particular properties that networks that Adam finds in reasonable time starting from standard network initializations do not.^[12] Even for problems where there's a single, clean, zero-loss global minima, there are generally low-loss saddle points featuring messy algorithms that incidentally mix together unrelated features.^[13] I think that this is one of the most likely alternatives to superposition for explaining actual neuron polysemanticity.

Feature sparsity exacerbates this problem: when features are sparse (as in models of superposition), interference between features is rare. So even when the minimum loss solution has basis-aligned features, the incentive toward aligning the features to be correct on the cases where both features are present will be at a much lower expected scale than the incentive of getting the problem correct in cases where only a single feature is present, and may be swamped by other considerations such as explicit regularization or by gradient noise.^[14]
Correlated features may be jointly represented: If features x and y in the environment have a strong correlation (either positive or negative), then the model may use a single neuron to represent a weighted combination of x and y instead of using two neurons to represent x and y independently.^[15] It's often the case that datasets contain many correlations that are hard for humans to notice. For example, image datasets can contain camera or post-processing artifacts and many seemingly unimportant features of pretraining datsets are helpful for next-token prediction. It seems possible to me that many seemingly unrelated concepts actually have meaningful correlation on even the true data distribution.

That being said, this does not explain the more obvious cases of polysemanticity that we seem to find in language models, where the features the neurons activate strongly on features that are almost certainly completely unrelated.
The computation may require mixing together “natural” features in each neuron: The main reason we think any particular neuron is polysemantic is that the features do not seem natural to interpretability researchers. But it might genuinely be the case that the optimal solution does not have neurons that are aligned with “natural” features!

For example, the “natural” way to approximate x * y, for independent x and y, with a single layer ReLU MLP, is to learn piecewise linear approximations of and , and then taking their difference.^[16] A naive interpretability researcher looking at such a network by examining how individual neurons behave on a handcrafted dataset (e.g. with contrast pairs where only one feature varies at a time) may wonder why neuron #5 fires on examples where x or y are both large.^[17] No superposition happens in this case – the problem is exactly two dimensional – and yet the neurons appear polysemantic.

Arguably, this is a matter of perspective; one might argue that x+y and x-y are the natural features, and we’re being confused by thinking about what each neuron is doing by looking at salient properties of the input.^[18] But I think this explains a nonzero fraction of seemingly polysemantic neurons, even if (as with the previous example) this reason does not explain the more clear cases of polysemanticity inside of language models.

The result of training a 4-neuron neural network to approximate x * y on unit normally distributed x, y. This network weights approximately compute, which is close to for data points in . The results are qualitatively consistent across seeds, data distribution, and hyperparameters, though the exact approximation to f(x) = x^2 can vary greatly as a function of the data distribution.

The fact that there are fewer features than neurons makes this case meaningfully distinct from superposition, because there exists an orthogonal transformation mapping activations onto a basis made of interpretable features, while in the superposition case no such transformation can exist.

Example 2: non-linear feature representations

Neural networks are not linear functions, and so can make use of non-linear feature representations.

For example, one-layer transformers can use non-linear representations and thus cannot be understood as a linear combination of skip tri-gram features due to the softmax allowing for inhibition. In Anthropic’s January 2024 update, they find empirically that forcing activations of a one-layer transformer to be sparse can lead to the network using non-linear representations via a similar mechanism. Outside of one-layer transformers, Conjecture’s Polytope Lens paper finds that scaling the activations of a single layer of InceptionV1 causes semantic changes later in the network.^[19]

From Conjecture’s Polytope Lens paper. Contrary to some forms of the linear representation hypothesis, scaling the activations at an intermediate layer can change the label the model assigns to an image, for example from cougar to fire screen or Shetland sheepdog to dishrag.

That being said, while I’d be surprised if neural networks used no non-linear feature representations, there’s a lot of evidence that neural networks represent a lot of information linearly. In casual conversation, I primarily point to the success of techniques such as linear probing or activation steering as evidence. It’s plausible that even if neural networks can and do use some non-linear feature representations, the vast majority of their behavior can be explained without referencing these non-linear features.

Example 3: compositional representation without “features”

More pathologically, neural networks have weights that allow for basically hash-map style memorization. For example, many of the constructions lower bounding the representation power of neural networks of a given size tend to involve “bit-extraction” constructions that use relus to construct interval functions that extract the appropriate bits of the binary representation of either the weights or the input.^[20] These approaches involve no meaningful feature representations at all, except insofar as it’s reasonable to use the exponentially many unique combinations of bits as your features.

From Bartlett et al 2019. The authors provide a construction for extracting the most significant bits of a number, then repeatedly apply this to look up the arbitrary, memorized label for particular data points.

The specific constructions in the literature are probably way too pathological to be produced by neural network training (for example, they often require faithfully representing exponentially large or small weights, in the dimension of the input, while any practical float representation simply lacks the precision required to represent these numbers). But it’s nonetheless conceptually possible that neural networks can represent something that’s better described as a dense binary representation of the input and not in terms of a reasonable number of linearly independent features, especially in cases where the neural network is trying to memorize large quantities of uniformly distributed data.^[21]

Conclusion: why does this distinction matter?

There’s a sense in which this post is basically making a semantic argument as opposed to a scientific one. After all, I don’t have a concrete demonstration of non-superposition polysemanticity in real models, or even a clear grasp of what such a demonstration may involve. I also think that superposition probably causes a substantial fraction of the polysemanticity we observe in practice.

But in general, it’s important to keep empirical observations distinct from hypotheses explaining the observations. As interpretability is a new, arguably pre-paradigmatic field, clearly distinguishing between empirically observed phenomena – such as polysemanticity – and leading hypotheses for why these phenomena occur – such as superposition – is even more important for both interpretability researchers who want to do and for grantmakers looking to fund impactful non-applied research.

In the specific case of polysemanticity and superposition, I think there are three main reasons:

Our current model of superposition may not fully explain neuron polysemanticity, so we should keep other hypotheses in mind

Polysemanticity may happen for other reasons, and we want to be able to notice that and talk about it.

For example, it seems conceptually possible that discussions of this kind are using the wrong notion of “feature".^[22] As noted previously, the form of sparsity used in most models of superposition implies a non-trivial restriction on what “features” can be. (It’s possible that a more detailed non-trivial model of superposition featuring non-uniform sparsity would fix this.^[23]) Assuming that polysemanticity and superposition are the same phenomenon means that it’s a lot harder to explore alternative definitions of “feature”.

Perhaps more importantly, our current models of superposition may be incorrect. The high dimensional geometry claim superposition depends on is that while there are only directions with dot product exactly zero (and directions in with pairwise dot product ), there are directions with dot production . Presumably real networks learn to ignore sufficiently small interference entirely. But in Anthropic's Toy Model of Superposition, the network is penalized for any positive interference, suggesting that it may be importantly incorrect.^[24]

In fact, after I first drafted this post, recent work came out from Google Deepmind that finds that SAEs that effectively round small positive interference to zero outperform vanilla SAEs that do not, suggesting that Anthropic’s Toy Model is probably incorrect in exactly this way to a degree that matters in practice. See the figure below, though note that this is not the interpretation given by the authors.

By using SAEs with Jump-ReLU activations (Gated SAEs) that round small values to zero, we get a Pareto improvement on ReLU SAEs (the correct model to recover features Anthropic's TMS), suggesting that the Anthropic TMS model is incorrect. Results are on SAEs trained on Gemma-7B, and from the recent Rajamanoharan et al paper from Google Deepmind: https://arxiv.org/abs/2404.16014.

Attempts to “solve superposition” may actually only be solving easier cases of polysemanticity

In conversations with researchers, I sometimes encounter claims that solving superposition only requires this other concept from neuroscience or earlier AI/ML research (especially from junior researchers, though there are examples of very senior researchers also making the same error). But in almost every case, these ideas address polysemanticity that occurs from other reasons than superposition.

For example, disentangled representation learning is a large subfield of machine learning that aims to learn representations that disentangle features of the environment. But this line of work almost always assumes that there are a small number of important latent variables, generally much smaller than your model has dimensions (in fact, I’m not aware of any counter examples, but I’ve only done a shallow dive into the relevant literature). As a result, many of these approaches will only apply to the easier case where superposition is false and the features are represented as a non–neuron aligned basis, and not to the hard case where superposition occurs meaningfully in practice.

Clear definitions are important for clear communication and rigorous science

To be fully transparent, a lot of the motivation behind this post comes not from any backchaining from particular concrete negative outcomes, but instead a general sense of frustration about time wasted to conceptual confusions.

Poor communication caused in large part by confusing terminology is a real issue that’s wasted hours of my time as a grant-maker, and my own muddled thinking about concepts has wasted dozens if not hundreds of hours of my time as a researcher. It’s much harder to estimate the cost in good projects not funded or not even conceptualized due to similar confusions.

I’d like to have this happen less in the future, and my hope is that this post will help.

Acknowledgements

Thanks to Eleni Angelou, Erik Jenner, Jake Mendel, and Jason Gross for comments on this post, and to Dmitry Vaintrob for helpful discussions on models of superposition.

^{^}
I don't have a good definition of the word feature, and I'm not aware of any in the literature. I think the lack of a clear idea of what "good" or "natural" feature are lies behind a lot of conceptual problems and disagreements in mech interp. That being said, I don't think you can solve this problem by doing philosophy on it without running experiments or at least doing math.
^{^}
For the sake of brevity, this post focuses primarily on superposition as an explanation for neuron polysemanticity, which is generally the type of polysemanticity that people study. Most of the claims straightforwardly apply to attention head superposition as well (both across attention heads and within a single attention head).
^{^}
After having read it in detail again for this post, I continue to think that Anthropic’s Toy Model of Superposition is an extremely well written piece, and would recommend at least reading the discussion and related work sections.
^{^}
See for example this 1981 paper from Hinton.
^{^}
In fact, it’s pretty easy to replicate these results yourself without writing any code, by using e.g. OpenAI Microscope to look at neurons in vision models and Neel Nanda’s Neuroscope for neurons in text models.
^{^}
Polysemanticity is less of a concern for approaches that don’t aim to derive mechanistic understanding, e.g. linear probing does not care polysemanticity as long as the features you probe for are linearly represented.
^{^}
Jason Gross notes in private communication that originally, superposition (as used in quantum mechanics) did mean “features are not aligned with our observational basis” (in other words, superposition ~= polysemanticity):
Superposition in quantum mechanics refers to the phenomenon that quantum systems can be in states which are not basis-aligned to physical observables, instead being linear combinations thereof. For example, a particle might be in a superposition of being "here" and being "there", or in a superposition of 1/sqrt(2)(spin up + spin down), despite the fact that you can never observe a particle to be in two places at once, nor to have two spins at once.

So if we stick to the physics origin, superposed representations should just be linear combinations of features, and should not require more features than there are dimensions. (And it's not that far a leap to get to nonlinear representations, which just don't occur in physics but would probably still be called "superposition" if they did, because "superposition" originally just meant "a state where the position [or any other observable] is something other than anything we could ever directly observe it to be in")
But “superposition” in the context of interpretability does mean something different and more specific than this general notion of superposition.
^{^}
Again, I'm intentionally leaving what a "feature" is vague here, because this is genuinely a point of disagreement amongst model internals researchers that this section is far too short to discuss in detail.
^{^}
Traditionally, this is shown by applying Johnson-Lindenstrauss to the origin + the standard basis in for m > d. You can also get this by a simple “surface area” argument – each unit vector “eliminates" a spherical cap on the unit hypersphere of exponentially decreasing measure in d (see this one-page paper for a concise mathematical argument), and so by greedily packing them you should get exponentially many near orthogonal unit vectors.
^{^}
Also, see any of the toy models in Anthropic’s TMS post, especially the 2 dimensional ones (like the one I included in this post). Note that while Anthropic’s TMS mentions both reasons, many of the toy models are very low-dimensional and mainly have superposition due to high sparsity + a slightly different definition of "linearly represented".
^{^}
This is nontrivial (albeit fairly justifiable in my opinion) assumption: sparsity of this form rules out very compositional notions of feature; e.g. if books in your library can be fantasy, science fiction, and historical fiction, and then can each category can be English, Hindi, French, or Chinese, the corresponding sparse features would be of the form “is this book in English” or “is this the genre of this book fantasy” and not “language” or “fantasy”. As this example shows, these models imply by construction that there are many, many features in the world. Contrast this to latent variable-based definitions of features such as Natural Abstractions or representation learning, which tend to assume a small number of important “features” that are sparsely connected but almost always present.
^{^}
This is especially true if you didn’t initialize your network correctly, or used bad hyperparameters. Also, see the Shard Theory for a concrete example of how “optimized for X” may not imply “optimal for X”.
^{^}
I've checked this empirically on a few toy cases such as approximating the element-wise absolute value of sparse real-valued vectors or the pairwise and of boolean values, where the optimal solution given the number of neurons in the network is a basis aligned zero-loss solution. With many settings of hyperparameters (the main important ones that affect the results seem to be weight decay and sparsity), these one-layer MLPs consistently get stuck in low but non-zero loss solutions, even with large batch size and with different first-order optimizers.

Due to the small size of the networks, this specific result may be due to optimization difficulties. But even when the MLPs are made significantly wide, such that they consistently converge to zero loss, the networks do not seem to learn monosemantic solutions (and instead prefer zero-loss polysemantic ones). So I think there's some part of the explanation of this experimental result that probably involves learning dynamics and the inductive biases of (S)GD/Adam and how we initialize neural networks.
^{^}
After drafting this post, I found this ICLR 2024 workshop paper that makes this case more strongly using a more thorough analysis of a slightly modified version of Anthropic’s Toy Model of Superposition, though I haven’t had time to read it in detail:
[..] We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, a phenomenon we term \textit{incidental polysemanticity}. Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap.
^{^}
See e.g. this toy example in Anthropic’s Toy Model of Superposition. In this setting, the environment has 6 natural features by construction that occur in three correlated pairs. When features are sufficiently dense and correlated, the model learns to represent only 3 features, compared to the 4 signed directions in 2 dimensions. That being said, this example is more analogous to the residual stream than to MLP activations, and more importantly uses a different notion of "feature" than "linearly represented dimension".
I generally refer to these features as ReLU-linear and not linear. That is, you recover them by a linear transformation into a higher dimensional space, followed by a ReLU. This different definition of "feature" means the example is not perfectly analogous to the argument presented previously. (For example, interference between ReLU-linear features is zero as long as the inner product between features is at most zero, instead of being exactly equal to zero, so there are 2d such features in : take any orthogonal basis and minus that basis.)
^{^}
This uses the fact that .
^{^}
This example isn’t completely artificial – it seems that this is part of the story for how the network in the modular addition work multiplies together the Fourier components of its inputs. Note that in this case, there is a low-dimensional natural interpretation that makes sense at a higher level (that the MLP layer linearly represents terms of the form cos(a)cos(b) and so forth), but looking at neurons one at a time is unlikely to be illuminating.
^{^}
I’m somewhat sympathetic to this perspective. For a more biting critique of interpretability along these lines, see Lucius Bushnaq’s “fat giraffe” example, though also see the discussion below that comment for counterarguments.
^{^}
That being said, note that Inception v1 uses Batch norm with fixed values during inference, so the results may not apply to modern transformers with pre-RMS norm.
^{^}
E.g. see the classic Bartlett et al 2019 for an example of extracting bits from the weights, and Lin and Jegelka 2018 which shows that 1 neuron residual networks are universal by extracting the appropriate bits from the input.
^{^}
Jake Mendel pointed out to me on a draft of the post that even if neural networks ended up learning this, none of our techniques would be able to cleanly distinguish this from other causes of polysemanticity.
^{^}
Again, consider Lucius Bushnaq’s “fat giraffe” example as an argument that people are using the wrong notion of feature.
^{^}
For example, I think that you can keep most of the results in both Anthropic’s TMS and Vaintrob, Mendel, and Hänni 2024 if the feature sparsity is exponentially or power-law distributed with appropriate parameters, as opposed to uniform.
^{^}
There's another objection here, which goes along the lines of "why is large negative inference treated differently than positive interference?". Specifically, I'm not aware of any ironclad argument for why we should care about directions with pairwise dot product as opposed to directions with dot product . H/t Jake Mendel for making this point to me a while ago.
See Vaintrob, Mendel, and Hänni 2024 (of which he is a coauthor) for a toy model that does “round off” both small negative and small positive interference to zero. That being said, that model is probably also importantly incorrect, because the functions implemented by neural networks are almost certainly disanalogous to simple boolean circuits.

Discuss

Anthropic release Claude 3, claims >GPT-4 Performance

Published on March 4, 2024 6:23 PM GMT

Today, we're announcing the Claude 3 model family, which sets new industry benchmarks across a wide range of cognitive tasks. The family includes three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application.

Better performance than GPT-4 on many benchmarks

The largest Claude 3 model seems to outperform GPT-4 on benchmarks (though note slight differences in evaluation methods):

Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K), and more. It exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.

Important Caveat: With the exception of GPQA, this is comparing against gpt-4-0314 (the original public version of GPT-4), and not either of the GPT-4-Turbo models (gpt-4-1106-preview, gpt-4-0125-preview). The GPT-4 entry for GPQA is gpt-4-0613, which performs significantly better than -0314 on benchmarks. Where the data exists, gpt-4-1106-preview consistently outperforms Claude 3 Opus. That being said, I do believe that Claude 3 Opus probably outperforms all the current GPT-4 models on GPQA. Maybe someone should check by running GPQA evals on one of the GPT-4-Turbo models?

Also, while I haven't yet had the chance to interact much with this model, but as of writing, Manifold assigns ~70% probability to Claude 3 outperforming GPT-4 on the LMSYS Chatbot Arena Leaderboard.

https://manifold.markets/JonasVollmer/will-claude-3-outrank-gpt4-on-the-l?r=Sm9uYXNWb2xsbWVy

Synthetic data?

According to Anthropic, Claude 3 was trained on synthetic data (though it was not trained on any customer-generated data from previous models):

Also interesting that the model can identify the synthetic nature of some of its evaluation tasks. For example, it provides the following response to a synthetic recall text:

Is Anthropic pushing the frontier of AI development?

Several people have pointed out that this post seems to take a different stance on race dynamics than was expressed previously:

As we push the boundaries of AI capabilities, we’re equally committed to ensuring that our safety guardrails keep apace with these leaps in performance. Our hypothesis is that being at the frontier of AI development is the most effective way to steer its trajectory towards positive societal outcomes.

EDIT: Lukas Finnveden pointed out that they included a footnote in the blog post caveating their numbers:

This table shows comparisons to models currently available commercially that have released evals. Our model card shows comparisons to models that have been announced but not yet released, such as Gemini 1.5 Pro. In addition, we’d like to note that engineers have worked to optimize prompts and few-shot samples for evaluations and reported higher scores for a newer GPT-4T model. Source.

And indeed, from the linked Github repo, gpt-4-1106-preview still seems to outperform Claude 3:

Ignoring the MMLU results, which use a fancy prompting strategy that Anthropic presumably did not use for their evals, Claude 3 gets 95.0% on GSM8K, 60.1% on MATH, 84.9% on HumanEval, 86.8% on Big Bench Hard, 93.1 F1 on DROP, and 95.4% on HellaSwag. So Claude 3 is arguably not pushing the frontier on LLM development.

EDIT2: I've compiled the benchmark numbers for all models with known versions:

On every benchmark where both were evaluated, gpt-4-1106 outperforms Claude 3 Opus. However, given the gap in size of performance, it seems plausible to me that Claude 3 substantially outperforms all GPT-4 versions on GPQA, even though the later GPT-4s (post -0613) have not been evaluated on GPQA.

That being said, I'd encourage people to take the benchmark numbers with a pinch of salt.

Discuss

Sam Altman fired from OpenAI

Published on November 17, 2023 8:42 PM GMT

Basically just the title, see the OAI blog post for more details.

Mr. Altman’s departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence in his ability to continue leading OpenAI.
In a statement, the board of directors said: “OpenAI was deliberately structured to advance our mission: to ensure that artificial general intelligence benefits all humanity. The board remains fully committed to serving this mission. We are grateful for Sam’s many contributions to the founding and growth of OpenAI. At the same time, we believe new leadership is necessary as we move forward. As the leader of the company’s research, product, and safety functions, Mira is exceptionally qualified to step into the role of interim CEO. We have the utmost confidence in her ability to lead OpenAI during this transition period.”

EDIT:

Also, Greg Brockman is stepping down from his board seat:

As a part of this transition, Greg Brockman will be stepping down as chairman of the board and will remain in his role at the company, reporting to the CEO.

The remaining board members are:

OpenAI chief scientist Ilya Sutskever, independent directors Quora CEO Adam D’Angelo, technology entrepreneur Tasha McCauley, and Georgetown Center for Security and Emerging Technology’s Helen Toner.

EDIT 2:

Sam Altman tweeted the following.

i loved my time at openai. it was transformative for me personally, and hopefully the world a little bit. most of all i loved working with such talented people.
will have more to say about what’s next later.

Greg Brockman has also resigned.

Discuss

Open Phil releases RFPs on LLM Benchmarks and Forecasting

Published on November 11, 2023 3:01 AM GMT

As linked at the top of Ajeya's "do our RFPs accelerate LLM capabilities" post, Open Philanthropy (OP) recently released two requests for proposals (RFPs):

An RFP on LLM agent benchmarks: how do we accurately measure the real-world, impactful capabilities of LLM agents?
An RFP on forecasting the real world-impacts of LLMs: how can we understand and predict the broader real-world impacts of LLMs?

Note that the first RFP is both significantly more detailed and has narrower scope than the second one, and OP recommends you apply for the LLM benchmark RFP if your project may be a fit for both.

Brief details for each RFP below, though please read the RFPs for yourself if you plan to apply.

Benchmarking LLM agents on consequential real-world tasks

Link to RFP: https://www.openphilanthropy.org/rfp-llm-benchmarks

We want to fund benchmarks that allow researchers starting from very different places to come to much greater agreement about whether extreme capabilities and risks are plausible in the near-term. If LLM agents score highly on these benchmarks, a skeptical expert should hopefully become much more open to the possibility that they could soon automate large swathes of important professions and/or pose catastrophic risks. And conversely, if they score poorly, an expert who is highly concerned about imminent catastrophic risk should hopefully reduce their level of concern for the time being.

In particular, they're looking for benchmarks with the following three desiderata:

Construct validity: the benchmark accurately captures a potential real-world, impactful capability of LLM agents.
Consequential tasks: the benchmark features tasks that will have massive economic impact or can pose massive risks.
Continuous scale: the benchmark improves relatively smoothly as LLM agents improve (that is, they don't go from ~0% performance to >90% like many existing LLM benchmarks have).

Also, OP will do a virtual Q&A session for this RFP:

We will also be hosting a 90-minute webinar to answer questions about this RFP on Wednesday, November 29 at 10 AM Pacific / 1 PM Eastern (link to come).

Studying and forecasting the real-world impacts of systems built from LLMs

Link to RFP: https://www.openphilanthropy.org/rfp-llm-impacts/

This RFP is significantly less detailed, and primarily consists of a list of projects that OP may be willing to fund:

To this end, in addition to our request for proposals to create benchmarks for LLM agents, we are also seeking proposals for a wide variety of research projects which might shed light on what real-world impacts LLM systems could have over the next few years.

Here's the full list of projects they think could make a strong proposal:

Conducting randomized controlled trials to measure the extent to which access to LLM products can increase human productivity on real-world tasks. For example:
Polling members of the public about whether and how much they use LLM products, what tasks they use them for, and how useful they find them to be.
In-depth interviews with people working on deploying LLM agents in the real world.
Collecting “in the wild” case studies of LLM use, for example by scraping Reddit (e.g. r/chatGPT), asking people to submit case studies to a dedicated database, or even partnering with a company to systematically collect examples from consenting customers.
Estimating and collecting key numbers into one convenient place to support analysis.
Creating interactive experiences that allow people to directly make and test their guesses about what LLMs can do.
Eliciting expert forecasts about what LLM systems are likely to be able to do in the near future and what risks they might pose.
Synthesizing, summarizing, and analyzing the various existing lines of evidence about what language model systems can and can’t do at present (including benchmark evaluations, deployed commercial uses, and qualitative case studies, etc) and what they might be able to do soon to arrive at an overall judgment about what LLM systems are likely to be able to do in the near term.

There's no Q&A session for this RFP.

Discuss

What I would do if I wasn’t at ARC Evals

Published on September 5, 2023 7:19 PM GMT

In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to work on.

Epistemic status: I’m prioritizing getting this out fast as opposed to writing it carefully. I’ve thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven’t done that much digging into each of these, and it’s likely that I’m wrong about many material facts. I also make little claim to the novelty of the projects. I’d recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.)

Standard disclaimer: I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of ARC/FAR/LTFF/Lightspeed or any other org or program I’m involved with.

Thanks to Ajeya Cotra, Caleb Parikh, Chris Painter, Daniel Filan, Rachel Freedman, Rohin Shah, Thomas Kwa, and others for comments and feedback.

Introduction

I’m currently working as a researcher on the Alignment Research Center Evaluations Team (ARC Evals), where I’m working on lab safety standards. I’m reasonably sure that this is one of the most useful things I could be doing with my life.

Unfortunately, there’s a lot of problems to solve in the world, and lots of balls that are being dropped, that I don’t have time to get to thanks to my day job. Here’s an unsorted and incomplete list of projects that I would consider doing if I wasn’t at ARC Evals:

Ambitious mechanistic interpretability.
Getting people to write papers/writing papers myself.
Creating concrete projects and research agendas.
Working on OP’s funding bottleneck.
Working on everyone else’s funding bottleneck.
Running the Long-Term Future Fund.
Onboarding senior(-ish) academics and research engineers.
Extending the young-EA mentorship pipeline.
Writing blog posts/giving takes.

I’ve categorized these projects into three broad categories and will discuss each in turn below. For each project, I’ll also list who I think should work on them, as well as some of my key uncertainties. Note that this document isn’t really written for myself to decide between projects, but instead as a list of some promising projects for someone with a similar skillset to me. As such, there’s not much discussion of personal fit.

If you’re interested in working on any of the projects, please reach out or post in the comments below!

Relevant beliefs I have

Before jumping into the projects I think people should work on, I think it’s worth outlining some of my core beliefs that inform my thinking and project selection:

Importance of A(G)I safety: I think A(G)I Safety is one of the most important problems to work on, and all the projects below are thus aimed at AI Safety.
Value beyond technical research: Technical AI Safety (AIS) research is crucial, but other types of work are valuable as well. Efforts aimed at improving AI governance, grantmaking, and community building are important and we should give more credit to those doing good work in those areas.
High discount rate for current EA/AIS funding: There’s several reasons for this: first, EA/AIS Funders are currently in a unique position due to a surge in AI Safety interest without a proportional increase in funding. I expect this dynamic to change and our influence to wane as additional funding and governments enter this space.^[1] Second, efforts today are important for paving the path to future efforts in the future. Third, my timelines are relatively short, which increases the importance of current funding.
Building a robust EA/AIS ecosystem: The EA/AIS ecosystem should be more prepared for unpredictable shifts (such as the FTX implosion last year). I think it’s important to robustify parts of the ecosystem, for example by seeding new organizations, building more legible credentials, doing more broad (as opposed to targeted) outreach, and creating new, independent grantmakers.
The importance of career stability and security: A lack of career stability hinders the ability and willingness of people (especially junior researchers) to prioritize impactful work over risk-averse, safer options. Similarly, cliffs in the recruitment pipeline due to a lack of funding or mentorship discourage pursuing ambitious new research directions over joining an existing lab. Personally, I’ve often worried about my future job prospects and position inside the community, when considering what career options to pursue, and I’m pretty sure these considerations weigh much more heavily on more junior community members.^[2]

Technical AI Safety Research

My guess is this is the most likely path I’ll take if I were to leave ARC Evals. I enjoy technical research and have had a decent amount of success doing it in the last year and a half. I also still think it’s one of the best things you can do if you have strong takes on what research is important and the requisite technical skills.

Caveat: Note that if I were to do technical AI safety research again, I would probably spend at least two weeks figuring out what research I thought was most worth doing,^[3] so this list is necessarily very incomplete. There’s also a decent chance I would choose to do technical research at one of OpenAI, Anthropic, or Google Deepmind, where my research projects would also be affected by management and team priorities.

Ambitious mechanistic interpretability

One of the hopes with mechanistic (bottom-up) interpretability is that it might succeed ambitiously: that is, we’re able to start from low-level components and build up to an understanding of most of what the most capable models are doing. Ambitious mechanistic interpretability would clearly be very helpful for many parts of the AIS problem,^[4] and I think that there’s a decent chance that we might achieve it. I would try to work on some of the obvious blockers for achieving this goal.

Here’s some of the possible broad research directions I might explore in this area:

Defining a language for explanations and interpretations. Existing explanations are specified and evaluated in pretty ad-hoc ways. We should try to come up with a language that actually captures what we want here. Both Geiger and Wu’s causal abstractions and our Causal Scrubbing paper have answers to this, but both are unsatisfactory for several reasons.
Metrics for measuring the quality of explanations. How do we judge how good an explanation is? So far, most of the metrics focus on the extensional equality (that is, how well the circuit matches their input-output behavior), but there are many desiderata besides that. Does percent loss recovered (or other input-output only criteria) suffice for recovering good explanations? If not, can we construct examples where it fails?
Finding the correct units of analysis for neural networks. It’s not clear what the correct low-level units of analysis are inside of neural networks. For example, should we try to understand individual neurons, clusters of neurons, or linear combinations of neurons? It seems pretty important to figure this out in order to e.g automate mechanistic interpretability.
Pushing the Pareto frontier on quality <> realism of explanations. A lot of manual mechanistic interpretability work focuses primarily on scaling explanations to larger models, as opposed to more complex tasks or comprehensive explanations, which I think are more important. In order for ambitious mechanistic interpretability to work out, we need to understand the behavior of the networks to a really high degree, instead of e.g. the ~50% loss recovered we see when performing Causal Scrubbing from the Indirect Object Identification paper. At the same time, existing mech interp work continues to primarily focus on simple algorithmic tasks, which seems like it misses out on most of the interesting behavior of the neural networks.

How you can work on it: Write up a research agenda and do a project with a few collaborators, and then start scaling up from there. Also, consider applying for the OpenAI or Anthropic interpretability teams.

Core uncertainties: Is the goal of ambitious mechanistic interpretability even possible? Are there other approaches to interpretability or model psychology that are more promising?

Late stage project management and paper writing

I think that a lot of good AIS work gets lost or forgotten due to a lack of clear communication.^[5] Empirically, I think a lot of the value I provided in the last year and a half has been by helping projects get out the door and into a proper paper-shaped form. I’ve done this to various extents for the modular arithmetic grokking paper, the follow-up work on universality, the causal scrubbing posts, the ARC Evals report, etc. (This is also a lot of what I’m doing at ARC Evals nowadays.)

I’m not sure exactly how valuable this is relative to just doing more technical research, but it does seem like there are many, many ideas in the community that would benefit from a clean writeup. While I do go around telling people that they should write up more things, I think I could also just be the person writing these things up.

How you can work on it: find an interesting mid-stage project with promising preliminary results and turn it into a well-written paper. This probably requires some amount of prior paper-writing experience, e.g. from academia.

Core uncertainties: How likely is this problem to resolve itself, as the community matures and researchers get more practice with write-ups? How much value is there in actually doing the writing, and does it have to funge against technical AIS research?

Creating concrete projects and research agendas

Both concrete projects and research agendas are very helpful for onboarding new researchers (both junior and senior) and for helping to fund more relevant research from academia. I claim that one of the key reasons mechanistic interpretability has become so popular is an abundance of concrete project ideas and intro material from Neel Nanda, Callum McDougal, and others. Unfortunately, the same cannot really be said for many other subfields; there isn’t really a list of concrete project ideas for say, capability evals or deceptive alignment research.

I’d probably start by doing this for either empirical ELK/generalization research or high-stakes reliability/relaxed adversarial training research, while also doing research in the area in question.

I will caveat that I think many newcomers write these research agendas with insufficient familiarity of the subject matter. I’m reluctant to encourage more people without substantial research experience to try to do this; my guess is the minimal experience is somewhere around one conference paper–level project and an academic review paper of a related area.

How you can work on it: Write a list of concrete projects or research agenda in a subarea of AI safety you’re familiar with. As discussed before, I wouldn’t recommend attempting this without significant amounts of familiarity with the area in question.

Core uncertainties: Which research agendas are actually good and worth onboarding new people onto? How much can you actually contribute to creating new projects or writing research agendas in a particular area without being one of the best researchers in that area?

Grantmaking

I think there are significant bottlenecks in the EA-based AI Safety (AIS) funding ecosystem, and they could be addressed with a significant but not impossible amount of effort. Currently, the Open Philanthropy project (OP) gives out ~$100-150m/year to longtermist causes (maybe around $50m to technical safety?),^[6] and this seems pretty small given its endowment of maybe ~$10b. On the other hand, there just isn’t much OP-independent funding here; SFF maybe gives out ~$20m/year,^[7] LTFF gives out $5-10m a year (and is currently having a bit of a funding crunch), and Manifund is quite new (though it still has ~$1.9M according to its website).^[8]

Caveat: I’m not sure who exactly should work in this area. It seems overdetermined to me that we should have more technical people involved, but a lot of the important things to do to improve grantmaking are not technical work and do not necessitate technical expertise.

Working on Open Philanthropy’s Funding Bottlenecks

(Note that I do not have an offer from OP to work with them; this is more something that I think is important and worth doing as opposed to something I can definitely do.)

I think that the OP project is giving way less money to AI Safety than it should be under reasonable assumptions. For example, funding for AI Safety probably comes with a significant discount rate, as it’s widely believed that we’ll see an influx of funding from new philanthropists or from governments, and also it seems plausible that our influence will wane as governments get involved.

My impression is mainly due to grantmaker capacity constraints; for example, Ajeya Cotra is currently the only evaluator for technical AIS grants. This can be alleviated in several ways:

Most importantly, working at OP on one of the teams that does AIS grantmaking.
Helping OP design and run more scalable grantmaking programs that don’t significantly compromise on quality. This probably requires working with them for a few months; just creating the RFP doesn’t really address the core bottleneck.
Creating good scalable alignment projects that can reliably absorb lots of funding.

How you can work on it: Apply to work for Open Phil. Write RFPs for Open Phil and help evaluate proposals. More ambitiously, create a scalable, low-downside alignment project that could reliably absorb significant amounts of funding.

Core uncertainties: To what extent is OP actually capacity constrained, as opposed to pursuing a strategy that favors saving funding for the future? How much of OP’s decision comes down to different beliefs about e.g. takeoff speeds? How good is broader vs more targeted, careful grantmaking?

Working on the other EA funders’ funding bottlenecks

Unlike OP, which is primarily capacity constrained, the remainder of the EA funders are funding constrained. For example, LTFF currently has a serious funding crunch. In addition, it seems pretty bad for the health of the ecosystem if OP funds the vast majority of all AIS research. It would be significantly healthier if there were counterbalancing sources of funding.

Here are some ways to address this problem: First and foremost, if you have very high earning potential, you could earn to give. Second, you can try to convince an adjacent funder to significantly increase their contributions to the AIS ecosystem. For example, Schmidt Futures has historically given significant amounts of money to AI Safety/Safety-adjacent academics, it seems plausible that working on their capacity constraints could allow them to give more to AIS in general. Finally, you could successfully fundraise for LTFF or Manifund, or start your own fund and fundraise for that.

How you can work on it: Convince an adjacent grantmaker to move into AIS. Fundraise for AIS work for an existing grantmaker or create and fundraise for a new fund. Donate a lot of money yourself.

Core uncertainties: How tractable is this, relative to alleviating OP’s capacity bottleneck? How likely is this to be fixed by default, as we get more AIS interest? How much total philanthropic funding would be actually interested in AIS projects? How valuable is a grantmaker who potentially doesn’t share many of the core beliefs of the AIS ecosystem?

Chairing the Long-Term Future Fund

(Note that while I am an LTFF guest fund manager and have spoken with fund managers about this role, I do not have an offer from LTFF to chair the fund; as with the OP section, this is more something that I think is important and worth doing as opposed to something I can definitely do.)

As part of the move to separate the Long-Term Future from Open Philanthropy, Asya Bergal plans to step down as LTFF Chair in October. This means that the LTFF will be left without a chair.

I think the LTFF serves an important part of the ecosystem, and it’s important for it to be run well. This is both because of its independent status from OP and because it’s the primary source of small grants for independent researchers. My best guess is that a well-run LTFF (even) could move $10m a year. On the other hand, if the LTFF fails, then I think this would be very bad for the ecosystem.

That being said, this seems like a pretty challenging position; not only is the LTFF currently very funding constrained (and with uncertain future funding prospects) and its position in Effective Ventures may limit ambitious activities in the future.

How you can work on it: Fill in this Google form to express your interest.

Core uncertainties: Is it possible to raise significant amounts of funding for LTFF in the long run, and if so, how? How should the LTFF actually be run?

Community Building

I think that the community has done an incredible job of field building amongst university students and other junior/early-career people. Unfortunately, there’s a comparative lack of senior researchers in the field, causing a massive shortage of both research team leads and a mentorship shortage. I also think that recruiting senior researchers and REs to do AIS work is valuable in itself.

Onboarding senior academics and research engineers

The clearest way to get more senior academics or REs is to directly try to recruit them. It’s possible the best way for me to work on this is to go back to being a PhD student, and try to organize workshops or other field building projects. Here are some other things that might plausibly be good:

Connecting senior academics and REs with professors or other senior REs, who can help answer more questions and will likely be more persuasive than junior people without much legible credentials. Note that I don’t recommend doing this unless you have academic credentials and are relatively senior.
Creating research agendas with concrete projects and proving their academic viability by publishing early stage work in those research agendas, which would significantly help with recruiting academics.
Create concrete research projects with heavy engineering slants and with clear explanations for why these projects are alignment relevant, which seems to be a significant bottleneck for recruiting engineers.
Normal networking/hanging out/talking stuff.
Being a PhD student and influencing your professor/lab mates. My guess is the highest impact here is to do a PhD at a location with a small number of AIS-interested researchers, as opposed to going to a university without any AIS presence.

Note that senior researcher field building has gotten more interest in recent times; for example, CAIS has run a fellowship for senior philosophy PhD students and professors and Constellation has run a series of workshops for AI researchers. That being said, I think there’s still significant room for more technical people to contribute here.

How you can work on it: Be a technical AIS researcher with interest in field building, and do any of the projects listed above. Also consider becoming a PhD student.

Core uncertainties: How good is it to recruit more senior academics relative to recruiting many more junior people? How good is research or mentorship if it’s not targeted directly at the problems I think are most important?

Extending the young EA/AI researcher mentorship pipeline

I think the young EA/AI researcher pipeline does a great job getting people excited about the problem and bringing them in contact with the community, a fairly decent job helping them upskill (mainly due to MLAB variants, ARENA, and Neel Nanda/Callum McDougal’s mech interp materials), and a mediocre job of helping them get initial research opportunities (e.g. SERI MATS, the ERA Fellowship, SPAR). However, I think the conversion rate from that level into actual full-time jobs doing AIS research is quite poor.^[9]

I think this is primarily due to a lack of research mentorship for junior and/or research management capacity at orgs, and exacerbated by a lack of concrete projects for younger researchers to work on independently. The other issue is that many junior people can overly fixate on explicit AIS-branded programs. Historically, all the AIS researchers who’ve been around for more than a few years got there without going through much of (or even any of) the current AIS pipeline. (See also the discussion in Evaluations of new AI safety researchers can be noisy.)

Many of the solutions here look very similar to ways to onboard senior academics and research engineers, but there are a few other ones:

Encourage and help promising researchers pursue PhDs.
Creating and funding more internship programs in academia, to use pre-existing research mentorship capacity.
Run more internship or fellowship programs that lead directly to full-time jobs, in collaboration with (or just from) AIS orgs.
Come up with a promising AIS research agenda, and then work at an org and recruit junior researchers.

In addition, you could mentor more people yourself if you're currently working as a senior researcher!

How you can work on it: Onboard more senior people into AIS. Encourage more senior researchers to mentor more new researchers. Create programs that make use of existing mentorship capacity, or that lead more directly to full-time jobs at AIS orgs.

Core uncertainties: How valuable are more junior researchers compared to more senior ones? How long does it take for a junior researcher to reach certain levels of productivity? How bad are the bottlenecks, really, from the perspective of orgs? (E.g. it doesn’t seem implausible to me that the most capable and motivated young researchers are doing fine.)

Writing blog posts or takes in general

Finally, I do enjoy writing a lot, and I would like to have the time to write a lot of my ideas (or even other people’s ideas) into blog posts.

Admittedly, this is primarily personal satisfaction–motivated and less impact-driven, but I do think that writing things (and then talking to people about them) is a good way to make things happen in this community. I imagine that the primary audience of these writeups will be other alignment researchers, and not the general LessWrong audience.

Here’s an incomplete list of blog posts I started in the last year that I unfortunately didn’t have the time to finish:

Ryan Greenblatt’s takes on why we should do ambitious mech interp (and avoid narrow or limited mech interp), which I broadly agree with.
Why most techniques for AI control or alignment would fail if a very powerful unaligned AI (an ‘alien jupiter brain’) manifested inside your datacenter, and why that might be okay anyways.
Why a lot of methods of optimizing or finetuning pretrained models (RLHF, BoN, quantilization, DPO, etc) are basically equivalent modulo (in theory) optimization difficulties or priors, and why people’s intuitions on differences between them likely come down to imagining different amounts of optimization power applied by different algorithms. (And my best guess as to the reasons for why they are significantly different in practice.)
The case for related work sections.
There are (very) important jobs besides technical AI research and how we as the community could do a better job at not discouraging people to take them.
Why the community should spend 50% less time talking about explicit status considerations.

There’s some chance I’ll try to write more blog posts in my spare time, but this depends on how busy I am otherwise.

How you can work on it: Figure out areas where people are confused, come up with takes that would make them less confused or find people with good takes in those areas, and write them up into clear blog posts.

Core uncertainties: How much impact do blog posts and writing have in general, and how impactful has my work been in particular? Who is the intended audience for these posts, and will they actually read them?

^{^}
Anecdotally, it’s been decently easy for AIS orgs such as ARC Evals and FAR AI to raise money from independent, non-OP/SFF/LTFF sources this year.
^{^}
Aside from the impact-based arguments, I also think it’s pretty bad from a deontological standpoint to convince many people to drop out or make massive career changes with explicit or implicit promises of funding and support, and then pull the rug from under them.
^{^}
In fact, it seems very likely that I’ll do this anyway, just for the value of information.
^{^}
For example, a high degree of understanding would provide ways to detect deceptive alignment, elicit latent knowledge, or provide better oversight; a very high degree of understanding may even allow us to do microscope or well-founded AI.

This is not a novel view; it’s also discussed under different names in other blog posts such as 'Fundamental' vs 'applied' mechanistic interpretability research, A Longlist of Theories of Impact for Interpretability, and Interpretability Dreams.
^{^}
As the worst instance of this, the best way to understand a lot of AIS research in 2022 was “hang out at lunch in Constellation”.
^{^}
The grants database lists ~$68m worth of public grants given out in 2023 for Longtermism/AI x-risk/Community Building (Longtermism), of which ~$28m was given to AI x-risk and ~$32m was given to community building. However, OP gives out significant amounts of money via grants that aren’t public.
^{^}
This is tricky to estimate since the SFF has given out significantly more money in the first half 2023 (~$21m) than it has in all 2022 (~$13m).
^{^}
CEA also gives out a single digit million worth of funding every year, mainly to student groups and EAGx events.
^{^}
This seems quite unlikely to be my comparative advantage, and it’s not clear it’s worth doing at all – for example, many of the impressive young researchers in past generations have made it through without even the equivalent of SERI MATS.

Discuss

Meta announces Llama 2; "open sources" it for commercial use

Published on July 18, 2023 7:28 PM GMT

See also their Llama 2 website here: https://ai.meta.com/llama, and their research paper here: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

From their blog post:

Takeaways
Today, we’re introducing the availability of Llama 2, the next generation of our open source large language model.
Llama 2 is free for research and commercial use.
Microsoft and Meta are expanding their longstanding partnership, with Microsoft as the preferred partner for Llama 2.
We’re opening access to Llama 2 with the support of a broad set of companies and people across tech, academia, and policy who also believe in an open innovation approach to today’s AI technologies.

Compared to the first Llama, LLama 2 is trained for 2T tokens instead of 1.4T, has 2x the context length (4096 instead of 2048), uses Grouped Query Attention, and performs better across the board, with performance generally exceeding code-davinci-002 on benchmarks:

They also release both a normal base model (Llama 2) and a RLHF'ed chat model (Llama 2-chat). Interestingly, they're only releasing the 7B/13B/70B models, and not the 34B model, "due to a lack of time to sufficiently red team".

More importantly, they're releasing it both on Microsoft Azure and also making it available for commercial use. The form for requesting access is very straightforward and does not require stating what you're using it for: (EDIT: they gave me access ~20 minutes after submitting the form, seems pretty straightforward.)

Note that their license is not technically free for commercial use always; it contains the following clauses:

[1.] v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

See also the Llama 2 Acceptable Use Policy (which seems pretty standard).

Discuss

Normal view

Introduction

A brief review of polysemanticity and superposition

Neuron polysemanticity

Superposition

Polysemanticity without superposition

Example 1: non–neuron aligned orthogonal features

Example 2: non-linear feature representations

Example 3: compositional representation without “features”

Conclusion: why does this distinction matter?

Our current model of superposition may not fully explain neuron polysemanticity, so we should keep other hypotheses in mind

Attempts to “solve superposition” may actually only be solving easier cases of polysemanticity

Clear definitions are important for clear communication and rigorous science

Acknowledgements

Better performance than GPT-4 on many benchmarks

Synthetic data?

Is Anthropic pushing the frontier of AI development?

Benchmarking LLM agents on consequential real-world tasks

Studying and forecasting the real-world impacts of systems built from LLMs

Introduction

Relevant beliefs I have

Technical AI Safety Research

Ambitious mechanistic interpretability

Late stage project management and paper writing

Creating concrete projects and research agendas

Grantmaking

Working on Open Philanthropy’s Funding Bottlenecks

Working on the other EA funders’ funding bottlenecks

Chairing the Long-Term Future Fund

Community Building

Onboarding senior academics and research engineers

Extending the young EA/AI researcher mentorship pipeline

Writing blog posts or takes in general

Takeaways