Normal view

There are new articles available, click to refresh the page.

Before yesterdayAI Alignment Forum

AXRP Episode 31 - Singular Learning Theory with Daniel MurfetDanielFilan
Published on May 7, 2024 3:50 AM GMTYouTube link What’s going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Topics we discuss: What is singular learning theory? Phase transitions Estimating the local
7 May 2024 at 11:50

AXRP Episode 31 - Singular Learning Theory with Daniel Murfet

Published on May 7, 2024 3:50 AM GMT

What’s going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us.

Topics we discuss:

In this transcript, to improve readability, first names are omitted from speaker tags.

Filan: Hello, everybody. In this episode, I’ll be speaking with Daniel Murfet, a researcher at the University of Melbourne studying singular learning theory. For links to what we’re discussing, you can check the description of this episode and you can read the transcripts at axrp.net. All right, well, welcome to AXRP.

Murfet: Yeah, thanks a lot.

What is singular learning theory?

Filan: Cool. So I guess we’re going to be talking about singular learning theory a lot during this podcast. So, what is singular learning theory?

Murfet: Singular learning theory is a subject in mathematics. You could think of it as a mathematical theory of Bayesian statistics that’s sufficiently general with sufficiently weak hypotheses to actually say non-trivial things about neural networks, which has been a problem for some approaches that you might call classical statistical learning theory. This is a subject that’s been developed by a Japanese mathematician, Sumio Watanabe, and his students and collaborators over the last 20 years. And we have been looking at it for three or four years now and trying to see what it can say about deep learning in the first instance and, more recently, alignment.

Filan: Sure. So what’s the difference between singular learning theory and classical statistical learning theory that makes it more relevant to deep learning?

Murfet: The “singular” in singular learning theory refers to a certain property of the class of models. In statistical learning theory, you typically have several mathematical objects involved. One would be a space of parameters, and then for each parameter you have a probability distribution, the model, over some other space, and you have a true distribution, which you’re attempting to model with that pair of parameters and models.

And in regular statistical learning theory, you have some important hypotheses. Those hypotheses are, firstly, that the map from parameters to models is injective, and secondly (quite similarly, but a little bit distinct technically) is that if you vary the parameter infinitesimally, the probability distribution it parameterizes also changes. This is technically the non-degeneracy of the Fisher information metric. But together these two conditions basically say that changing the parameter changes the distribution changes the model.

And so those two conditions together are in many of the major theorems that you’ll see when you learn statistics, things like the Cramér-Rao bound, many other things; asymptotic normality, which describes the fact that as you take more samples, your model tends to concentrate in a way that looks like a Gaussian distribution around the most likely parameter. So these are sort of basic ingredients in understanding how learning works in these kinds of parameterized models. But those hypotheses do not hold, it’s quite easy to see, for neural networks. I can go into more about why that is.

So the theorems just don’t hold. Now, you can attempt to make use of some of these ideas anyway, but if you want a thoroughgoing, deep theory that is Bayesian and describes the Bayesian learning process for neural networks, then you have to be proving theorems in the generality that singular learning theory is. So the “singular” refers to the breaking of these hypotheses. So the fact that the map from parameters to models is not injective, that means, in combination with this other statement about the Fisher information metric, that if you start at a neural network parameter, then there will always be directions you can vary that parameter without changing the input/output behavior, without changing the model.

Some of those directions are kind of boring, some of them are interesting, but that’s what singular learning theory is about: accommodating that phenomenon within the space of neural networks.

Filan: The way I’d understood it is that this basically comes down to symmetries in the neural network landscape. You can maybe scale down this neuron and scale up this neuron, and if neurons are the same, it doesn’t matter. But not only are there symmetries, there are non-generic symmetries.

Murfet: Correct. Yeah.

Filan: Because if there were just some symmetries, then maybe you could mod out by the symmetries… If you looked at the normal direction to the space at which you could vary things, then maybe that would be fine. So the way I’ve understood it is that there are certain parameter settings for neural networks where you can change it one way or you can change it another way, but you can’t change it in both directions at once. And there are other parameter settings where you can only change it in one of those two ways. So the fact that you can’t do them both at once means it’s not a nice, smooth manifold. And the fact that it’s different at different places means that it’s not this generic thing over the whole space. Some models are more symmetric than others and that ends up mattering.

Murfet: Yeah, I would say that’s mostly correct. I would say the word ‘symmetry’ is really not… I think I would also at a high level maybe use this word to a first approximation in explaining what’s going on, but it’s really not a sufficient concept. But yeah, it’s good to distinguish the kind of boring generic symmetries that come from the non-linearities. So in some sense, that’s why you can just look at a neural network and know that it’s singular because of these symmetries, like the ReLU scaling up the input and scaling down the output weights respectively will not change the behavior of the network. So that’s an obvious scaling symmetry, and that means that it’s degenerate and therefore a singular model.

But if that was all there was, then I agree: somehow that’s a boring technical thing that doesn’t seem like you really need, from a point of view of understanding the real phenomena, to care about it too much. But the reason that SLT [singular learning theory] is interesting is that, as you say, different regions of parameter space, you could say have different kinds of symmetries as a reflection of the different ways qualitatively in which they’re attempting to model the true distribution. But this other thing you mentioned, about being able to move in different directions, that’s not really symmetry so much as degeneracy.

So we could go more into conceptually why different regions or different kinds of solutions might have different kinds of degeneracy, but at a high level that’s right. Different kinds of solutions have different kinds of degeneracy, and so being able to talk about different kinds of degeneracy and how they trade off against one another, and why Bayesian learning might prefer more or less degenerate kinds of models, is the heart of SLT.

Filan: Sure. Before we go into that, what do you mean by “degeneracy”?

Murfet: Degeneracy just refers to this failure of the map from parameters to models to be injective. So “a degeneracy” would just mean a particular kind of way in which you could vary the neural network parameter, say in such a way that the input/output map doesn’t change. And as you were just mentioning, you might have, at one point, two or more essentially different ways in which you could vary the parameter without changing the loss function. And that is by definition what geometry is. So what I’m describing there with my hand is the level set of the loss function. It might be the minimal level set or some other level set, but if we’re talking about multiple ways I can change the neural network parameter without changing the loss, then I’m describing the configuration of different pieces of the level set of the loss function at that point. And that’s what geometry is about.

Filan: Sure. You mentioned that singular learning theory, or SLT for short, is very interested in different kinds of degeneracies. Can you tell us a little bit [about] what are the kinds of degeneracies, what different kinds of degeneracies might we see maybe in deep learning? And why does the difference matter?

Murfet: I think it’s easier to start with a case that isn’t deep learning, if that’s all right. Deep learning jumps straight into the deep end in terms of… and it’s also the thing which we understand least, perhaps. But if you imagine the easiest kind of loss functions… and when I say loss function, I typically mean “population loss”, not the empirical loss from a fixed dataset of finite size, but the average of that of all datasets. So that’s somehow the theoretical object whose geometry matters here, so I’ll flag that, and there’s some interesting subtleties there.

So in a typical case, in a regular statistical setting - not neural networks, but linear regression or something - the population loss looks like a sum of squares, so just a quadratic form. And there, minimizing it - I mean maybe with some coefficients, the level sets are ellipses - then the learning process just looks like moving down that potential well to the global minimum. And that’s kind of all that’s happening. So in that case, there’s no degeneracy. So there’s just one global minimum and you can’t vary it at all and still have zero loss.

A more interesting case would be where: suppose you have 10 variables, but a sum of eight squares, so x1² through x8². And then if you minimize that, well, you’ve still got two free parameters, so there’s a two-dimensional space of global minima of that function. Now imagine a population loss, and let’s only care about local minima, which has many local minima at various heights of the loss, each of which use different numbers of variables. So we suppose, for instance, that the global minimum maybe uses all 10, but then there’s a level set a bit higher than that that uses only nine squares, and a level set a bit higher than that that uses only eight squares. And so then those have different amounts of degeneracy.

So you have different points in the parameter space, loss landscape, where local minima have different degrees of degeneracy. And so then you can think about the competition between them in terms of trading off between preference for degeneracy versus preference for loss. And then we’re getting into key questions of if you’re a Bayesian, what kind of solution you prefer in terms of accuracy versus degeneracy.

Filan: And I guess this gets to this object that people talk about in singular learning theory called the “learning coefficient”. Can you tell us a little bit about what the learning coefficient is?

Murfet: In the case I was just describing, it’s easy to say what the learning coefficient is. There’s a distinction between global learning coefficient… Everything I say about SLT, more or less, is material that was introduced by Watanabe and written about in his books, and at some point, I guess we’ll talk about our contributions more recently. But mostly what I’m describing is not my own work, just to be clear.

So I’ll mostly talk about the local learning coefficient, which is a measure of degeneracy near a point in parameter space. If I take this example I was just sketching out: you imagine the global minimum level set and then some higher level sets. And I said that the population loss near the global minimum looked like a sum of 10 squares. And so the local learning coefficient of that would just be 10/2, so a half times the number of squares that you used.

So if there was a level set that had used only eight squares, then that’s degenerate, because you have two free directions, so it’s not a single isolated minimum, but rather a sort of two-dimensional plane of minimum. And each point of that two-dimensional plane would, because it locally looks like a sum of eight squares, have 8/2 as its local learning coefficient and so on. So if you use D’ squares in the local expression of your population loss, then your local learning coefficient is D’/2. That’s not how it’s defined: it has a definition, which we could get into various different ways of looking at it, but that’s what it cashes out to in those examples.

Filan: Sure. I guess the way to think about this local learning coefficient is that when it’s lower, that’s a solution that’s more degenerate. And the way I gather Bayesian inference works is that it tries to have both a low loss and also a low local learning coefficient. Does that sound right?

Murfet: Yep.

Filan: An image I often see in discussions of singular learning theory is people drawing doodles of trefoils and figure eights and maybe a circle to throw in there. The thing I often hear (as a caricature) is: initially you stay around the trefoil for a while, this is where you put your posterior mass, until at some point you get enough data and then you start preferring this figure eight, and then you get even more data and then you start preferring this circle, which has maybe even lower loss. So as you go down, maybe you get better loss, let’s just say, but the local learning coefficient is going to increase and therefore get worse.

Murfet: Maybe I’ll caveat that a little: the local learning coefficient is increasing, so you’re accepting a more complex solution in exchange for it being more accurate.

Phase transitions

Filan: Yeah. So that’s the very basic idea of singular learning theory. Why does it matter? What are the important differences between the singular learning theory picture and the classical statistical learning theory picture?

Murfet: In what context? Statistical learning theory in general, deep learning theory, alignment, or all three in that order?

Filan: Maybe all three in that order. I think I want to put off the discussion of alignment relevance for a little bit later until we just understand what’s going on with this whole thing.

Murfet: Okay. Yeah, I guess I didn’t actually come back to your question about the local learning coefficient in neural networks from earlier, but I think the cartoon in terms of sums of squares might still suffice for the moment.

If we talk about statistical learning theory in machine learning or deep learning in general, I think the main high-level conceptual takeaway from singular learning theory when you first encounter it should be that the learning process in Bayesian statistics really is very different for singular models. So let me define what I mean by “learning process”.

When we say “learning process” in deep learning, we tend to mean training by stochastic gradient descent. And what I’m saying is maybe related to that, but that’s a tricky point, so let me be clear that in Bayesian statistics, the “learning process” refers to: as you see more data, you change your opinion about what the relative likelihood of different parameters is. So you see more data, some parameters become ruled out by that data because they don’t give that data high probability, whereas other parameters become more likely. And what I’m describing is the Bayesian posterior, which assigns a probability to each parameter according to the data.

And so as you see more samples… I mean, if you’ve seen very few samples, you really have no idea which parameters are correct, so the posterior is very diffuse and will change a lot as you see more samples because you just are very ignorant. But asymptotic normality and regular statistical learning theory says that as you see more samples, that process starts to become more regular and concentrate around the true parameter in a way that looks like a Gaussian distribution.

So that’s in some sense a very simple process. But in singular models, that is not what happens, at least that’s not what’s predicted to happen by the theory. Until relatively recently, I think we didn’t have many very compelling examples of this in practice. But what the theory says is what you were describing earlier, that the Bayesian posterior should kind of jump as the trade-off between accuracy and complexity changes, which is a function of the number of samples. And those jumps move you from regions of qualitatively different solutions to other kinds of solutions, and then eventually maybe asymptotically to even choosing among perfect solutions depending on their complexity and then so on.

So there’s a very complicated, not very well-understood process underlying learning in Bayesian statistics for singular models, which as far as I know, Watanabe and his collaborators are the only people to ever really study. This is despite being somewhat old, in the sense that Watanabe and students and collaborators have been working on it for a while; it’s really not been studied in great depth outside of their group.

So [it’s] a very fundamental process in Bayesian statistics, relatively understudied, but arguably, at least if you take a Bayesian perspective, very central to how learning works in (say) neural networks, whether they’re artificial ones or even possibly biological ones.

So I think that’s the main thing. I mean, that’s not the only thing singular learning theory talks about. It’s not the only theoretical content, but I would say that’s the main thing I would want someone to know about the theory as it stands right now. The other thing is how that relates to generalization, but maybe I’ll pause there.

Filan: Sure. Maybe we should talk about that a bit. I hear people talk about this with the language of phase transitions. And I think upon hearing this, people might say, “Okay, if you look at loss curves of big neural nets that are being trained on language model data, the loss kind of goes down over time, and it doesn’t appear to be stuck at one level and then suddenly jump down to another level and then be flat and then suddenly jump down.” We have things which kind of look like that in toy settings, like grokking, like the development of induction heads, but it doesn’t generically happen. So should we think of these phase transitions as being relevant to actual deep learning, or are they just a theoretical curiosity about the Bayesian theory?

Murfet: Yeah, I think that’s a very reasonable question. I think a year ago, we ourselves were skeptical on this front. I think even in toy settings it wasn’t very clear that this theoretical prediction bears out. So maybe I’ll spend a moment to just be quite precise about the relationship between theory and practice in this particular place.

What the theory says is: asymptotically in N, the number of samples, a certain formula describing the posterior works, and then based on this formula, you can have the expectation that phase transitions happen. But in principle, you don’t know lower-order terms in the asymptotic, and there could be all sorts of shenanigans going on that mean that this phenomenon doesn’t actually occur in real systems, even toy ones. So theory on its own - I mean in physics or in machine learning or whatever - has its limits, because you can’t understand every ingredient in an asymptotic expansion. So even in toy settings, it was reasonable, I think, to have some skepticism about how common this phenomenon was or how important it was, even if the theory is quite beautiful.

Okay, so that aside, you go and you look in toy systems and you see this behavior, as we did, and then I think it’s reasonable to ask, “Well, okay, so maybe this happens in small systems, but not in large systems?” And indeed in learning curves, we don’t think we see a lot of structure.

So I’ll tell you what we know, and then what I think is going on. I should preface this by saying that actually we don’t know the answer to this question. So I think that it still remains unclear if this prediction about phases and phase transitions is actually relevant to very large models. We’re not certain about that. I would say there’s a reasonable case for thinking it is the case that it is relevant, but I want to be clear about what we know and don’t know.

Again, this is kind of an empirical question, because the theoretical situation under which phases and phase transitions exist… the theory stops at some point and doesn’t say much at the moment about this scale or that scale.

So what we know is that if you look at transformers around the scale of three million parameters, trained on language model datasets, you do see something like phases and phase transitions that basically describe… So again, what I’m about to describe is the learning process of training rather than seeing more samples. But the theoretical jump that we’re making here is to say, okay, if Bayesian statistics says certain kinds of structures in the model - if the theory says there should be qualitative changes in the nature of the way the posterior is describing which models are probable, if there are qualitative changes in that over the course of the Bayesian learning process, as you see more samples, then you might expect something similar when you go and look at seeing cumulatively more examples through the training process of stochastic gradient descent. But that is not a theoretically justified step at this point in some rigorous sense. That’s the kind of prediction you might make assuming some similarity between the learning processes, and then you can go in empirically and see if it’s true.

So if you go and look at language models at the scale of three million parameters… This is a recent paper that we did, Developmental Landscape of In-Context Learning. If you go and look at that, what you see [is] that the training process is divided into four or five stages, which have different qualitative content in a way that isn’t visible in the loss curve mostly.

Filan: It is a little bit visible.

Murfet: Yeah, I would agree with that. I mean, to the same extent that the induction bump is sort of visible in the original in-context learning and induction heads paper.

Filan: Yeah. I mean, it’s not obvious from the loss curve. It’s not like everybody already knew all the things that you guys found out.

Murfet: Yeah, I would say that without these other results, if you looked at the loss curve and tried to tell the story about these little bumps, it would feel like tea leaf reading. But once you know that the stages are there, yes, you can look at the loss curve and sort of believe in certain features of them.

So I mean, there’s various details about how you think about the relationship between those stages and phases and phase transitions in a sense of SLT. But I would say that’s still a very small model, but not a toy model, in which you do see something like stage-wise development.

And there are independent reasons… People have independently been talking about stage-wise development in learning systems outside of SLT. So I would say that the SLT story and stage-wise development as a general framing for how structure arrives inside self-organizing learning processes, that dovetails pretty well. So I would say that, to come back to your question about structure in the loss curve, just because nothing’s happening in the loss curve doesn’t mean that there isn’t structure arriving in stages within a model. And our preliminary results on GPT-2 Small at 160 million parameters: at a high level it has stages that look pretty similar to the ones in the three million parameters.

Filan: Interesting.

Murfet: So here’s my guess for what’s going on. It’s true that in very large models, the system is learning many things simultaneously, so you won’t see very sharp transitions except possibly if they’re very global things: [e.g.] switching to in-context learning as a mode of learning seems like it affects most of the things that a system is learning, so a qualitative change at that scale, maybe you would guess actually is represented sort of at the highest level and might even be visible in the loss curve, in the sense that everything is coordinated around that. There’s before and after.

But many other structures you might learn, while they’re developing somewhere else in the model, it’s memorizing the names of U.S. presidents or something, which just has nothing to do with structure X, Y, Z. And so in some sense, the loss curve can’t possibly hit a plateau, because even if it’s hitting a critical point for these other structures X, Y, Z, it’s steadily making progress memorizing the U.S. presidents. So there can’t be clear plateaus.

So the hypothesis has to be something like: if there is stage-wise development, which is reflected by these phases and phase transitions, it’s in some sense or another localized, maybe localized to subsets of the weights and maybe localized in some sense to certain parts of the data distribution. So the global phases or phase changes which touch every part of the model and affect every kind of input are probably relatively rare, but that isn’t the only kind of phase, phase transition, stage to which Bayesian statistics or SLT could apply.

Filan: Sure. Should I imagine these as being sort of singularities in a subspace of the model parameter space? The learning coefficient kind of picks them out in this subspace, but maybe not in the whole parameter space?

Murfet: Yeah, that’s kind of what we’re thinking. These questions are pushing into areas that we don’t understand, I would say. So I can speculate, but I want to be clear that some parts of this we’re rather certain of: the mathematical theory is very solid, the observation of the correspondence between the theory and Bayesian phase transitions in toy models is empirically and theoretically quite solid. This question of what’s happening in very large systems is a deep and difficult question. I mean, these are hard questions, but I think that’s right, that’s the motivation for… One of the things we’re currently doing is what we call weight-restricted local learning coefficients. This basically means you take one part of the model, say, a particular head, you freeze all the other weights…

Let me just give a more formal setting. When we’re talking about the posterior and the local learning coefficient and so on, we imagine a space of parameters. So there’s D dimensions or something. Some of those directions in parameter space belong to a particular head, and I want to take a parameter that, at some point in training, has some values for all these heads, I mean, for all these different weights, and I want to freeze all but the ones in the head and then treat that as a new model. Now, my model is I’m not allowed to change those weights, but I’m allowed to change the weights involved in the head, and I can think about the Bayesian posterior for that model and I can talk about its local learning coefficient.

That involves perturbing the parameter nearby that particular coefficient, but in a way where you only perturb the weights involved in that part of the structure, say, that head, and you can define the complexity of that local learning coefficient. That’s what we call the weight-restricted local learning coefficient. And then the hypothesis would be that, if a particular part of the model is specializing in particular kinds of structure and that structure is developing, then you’ll be at a critical point for some kind of restricted loss that is referring only to those weights, and that would show up.

We haven’t talked about how the local learning coefficient is used to talk about phase transitions, but that’s the experimental way in which you’d attempt to probe whether some part of the model is doing something interesting, undergoing a phase transition separately from other parts of the model.

Filan: Yeah, actually, maybe we should clarify that. How do you use the learning coefficient to figure out if a phase transition is happening?

Murfet: It depends on your background which answer to this question is most pleasant. For physics-y people who know about free energy, they’re familiar with the idea that various derivatives of the free energy should do something discontinuous at a phase transition, and you can think about the local learning coefficient as being something like that. So that, if there is a phase transition, then you might expect this number to change rapidly relative to the way it usually changes.

If we just stick within a statistical learning theory frame, we were laying out this picture earlier of: as you see more samples, the Bayesian posterior is concentrated in some region of parameter space and then rapidly shifts to be concentrated somewhere else, and the local learning coefficient is a statistic of samples from the Bayesian posterior, so if the Bayesian posterior shifts, then this number will also shift. The expectation would be that, if you measure this number, which it turns out you can do from many experiments, if you see that number change in some significant way, then it is perhaps evidence that some qualitative change in the posterior has occurred. That’s a way of detecting phase transitions which is, if you take this bridge from Bayesian statistics to statistical physics, pretty well justified I would say.

Estimating the local learning coefficient

Filan: Sure. A question about that: my understanding is that trying to actually measure the local learning coefficient involves taking a parameter setting and looking at a bunch of parameter settings nearby on all these dimensions that you could vary it, and measuring a bunch of properties, and this is the kind of thing that’s easy to do when you have a very low-dimensional parameter space corresponding to a small number of parameters. It seems like it’s going to be harder to do with a higher number of parameters in your neural networks. Just practically, how large a model can you efficiently measure local learning coefficient [for] at this time?

Murfet: Yeah. That’s a good question. I think it’s tricky. Maybe this will be a bit of an extended answer, but I think it’ll be better if I provide some context. When we first started looking at SLT, myself and my colleague here at the University of Melbourne, Susan Wei, and some other people… This was before… believe it or not, today there are 10x the number of people interested in SLT than there were back when we started thinking about it. It was an extremely niche subject, very deep and beautiful, but somewhat neglected.

Our question at that time was exactly this question. The theory says the local learning coefficient - the “real log canonical threshold” is another mathematical name for it - the theory says this is a very interesting invariant, but it’s very unclear if you can accurately estimate it in larger models. A lot of the theoretical development [involved using] one PhD student to compute the RLCT of one model theoretically, and you need some hardcore algebraic geometry to do that, et cetera, et cetera. The way the subject sat, it wasn’t clear that you could really be doing this at scale because it seems to depend on having very accurate samples from the posterior via Markov Chain Monte Carlo sampling or something.

I admit, I was actually extremely pessimistic when we first started looking at it that there really would be a future in which we’d be estimating RLCTs, or local learning coefficients, of a hundred million parameter models. So that’s where I started from. My colleague Susan and my PhD student Edmund Lau decided to try SGLD, stochastic gradient Langevin dynamics, which is an approximate Bayesian sampling procedure based on using gradients, to see how it worked. There’s a step in estimating the local learning coefficient where you need samples from the posterior. As you’re describing, this is famously difficult for large dimensional complex models.

However, there is a possible loophole, which is that… I mean, I don’t believe that anybody has a technique, nor probably ever will, for understanding or modeling very accurately the Bayesian posterior of very large-scale models like neural networks. I don’t think this is within scope, and I’m skeptical of anybody who pretends to have a method for doing that, hence why I was pessimistic about estimating the LLC [local learning coefficient] at scale because it’s an invariant of the Bayesian posterior which seems to have a lot of information about it and I believe it’s hard to acquire that information. The potential loophole is that maybe the local learning coefficient relies on relatively robust signals in the Bayesian posterior that are comparatively easy to extract compared to knowing all the structure.

That seems to be the world that we are in. To answer your question, Zach Furman and Edmund Lau just recently had a pre-print out where, using SGLD, it seems you can get relatively accurate estimates for the local learning coefficient for deep linear networks: products of matrices and known nonlinearities at scales up to a hundred million parameters.

Filan: A hundred million with an M?

Murfet: With an M, yeah. One should caveat that in several ways, but yeah.

Filan: Okay, and am I right that this is distinct from the “Quantifying degeneracy with the local learning coefficient” paper?

Murfet: That’s right. This is a second paper, a followup to that. I forget the title. I think it’s Estimating Local Learning Coefficient at Scale. So we wrote that paper a couple of years ago now, I think, looking at defining the local learning coefficient - which is implicit in Watanabe’s work, but we made it explicit - and making the observation that you could use approximate sampling to estimate it and then studying that in some simple settings, but it remained very unclear how accurate that was in larger models.

Now, the reason it’s difficult to go and test that is because we don’t know the true local learning coefficient for very many models that can be increased in some direction of scale. We know it for one hidden layer tanh networks and things like that. But some recent, very deep, interesting work by Professor Miki Aoyagi gives us the true value of the local learning coefficient for deep linear networks, which is why Zach and Edmund studied those. This was an opportunity to see if SGLD is garbage or not for this purpose.

I should flag that despite… How should I say this? SGLD is a very well-known technique for approximate Bayesian posterior sampling. I think everybody understands that you should be skeptical of how good those posterior samples are in some sense. It might be useful for some purpose, but you shouldn’t really view it as a universal solvent for your Bayesian posterior sampling needs or something. Just using SGLD doesn’t magically mean it’s going to work, so I would view it as quite surprising to me that it actually gives accurate estimates at scale for deep linear networks.

Now, having said that, deep linear networks are very special, and they are less degenerate in some important ways than real neural networks with nonlinearities, et cetera, so don’t take me as saying that we know that local learning coefficient estimation gives accurate values of the local learning coefficient for language models or something. We have basically no idea about that, but we know it’s accurate in deep linear networks.

Okay, so then what is generalizable about that observation? I think it leads us to believe that maybe estimating the LLC, SGLD is actually not garbage for that. How good it is we still don’t know, but maybe this cheap posterior sampling is still good enough to get you something interesting. And then the other thing is that: well, what you observe in cases where you know the true values is that, when the model undergoes phase transitions which exist in deep linear networks, as many people have… Maybe not in those exact terms, but, stage-wise development in deep linear networks has been studied for quite a long time, and you can see that this local learning coefficient estimator which is measuring the complexity of the current parameter during the learning process does jump in the way you would expect in a phase transition, when deep linear networks go through these phase transitions.

Well, it had to, because we know theoretically what’s happening to the geometry there. Those jumps in the local learning coefficient in other models, like these 3 million parameter language models or GPT-2 Small… when you go and estimate the local learning coefficient, you see it change in ways that are indicative of changes in internal structure. Now, we don’t know that the absolute values are correct when we do that, and most likely they’re not, but I think we believe in the changes in the local learning coefficient reflecting something real to a greater degree than we believe in the absolute values being real. Still, theoretically, I don’t know how we would ever get to a point where we would know the local learning coefficient estimation was accurate in larger models absent really fundamental theoretical improvements that I don’t see coming in the near term, but that’s where we are at the moment.

Singular learning theory and generalization

Filan: Fair enough. A while back, you mentioned the contributions of singular learning theory to understanding deep learning. There was something to do with phase transitions and there was also something to do with generalization, I think you mentioned. I want to ask you about that. Especially in the context of: I sometimes hear people say, “Oh, statistical learning theory says that model classes can have these parameters that have some degeneracy and that basically reduces their effective parameter count, and this just explains how generalization is possible.” This is the kind of story one can tell when one feels excitable, but it’s a bit more complicated. It’s going to depend on details of how these parameters actually translate into functions and what these degeneracies actually look like in terms of predictive models. What does singular learning theory tell us about generalization, particularly in the context of deep networks?

Murfet: Yeah. This is subtle. On its face, [in] singular learning theory, the theorems describe relations between loss, local landscape geometry, this local learning coefficient, and generalization error in the Bayesian sense. In the Bayesian sense, what I mean by generalization error is the KL divergence between the true distribution and the predictive distribution.

Maybe I should say briefly what the latter is. If you’re trying to make a prediction, if you’re talking about a conditional distribution, a prediction of Y given X, and you look at all the parameters that you’ve got for modeling that relationship, and you’re given an input and you take the prediction from every single model parameterized by your parameter space, you weight it with the probability given to that particular model by the Bayesian posterior and you average them all in that way, that’s the Bayesian predictive distribution. [It’s] obviously radically intractable to use that object or find that object. It’s a theoretical object. That probability distribution is probably not one that’s parameterized by parameters in your parameter space, but you can cook it up out of models in your parameter space. The KL divergence between that and the truth is the Bayesian generalization error.

Filan: The KL divergence just being a measure of how different probability distributions are.

Murfet: Right. That seems like a very theoretical object. There’s a closely related object, the Gibbs generalization error, which puts some expectations in different orders which is closer to what people in machine learning mean by “test error” - taking a parameter and trying it out on some samples from the true distribution that weren’t used to produce that parameter. There’s the various subtleties there. SLT, strictly speaking, only says things about those kinds of generalization errors and the relationship between that and test error for a parameter produced by a single run of SGD - well, I don’t even know that that is a mathematical object actually (test error for a parameter after a single run), but you can do things like talk about, for some distribution of SGD runs, what’s the expected test error.

Then there’s a gap between that Bayesian story and what you mean by “test error” in deep learning. This gap hasn’t been very systematically addressed, but I’ll lay out some story about how you might bridge that eventually in order to answer your question. If you believe that the Bayesian learning process ends with a distribution of parameters that look something like the endpoints of SGD training, or at least close enough, that something like this average of SGD runs of the test error looks a bit like averaging over things in the Bayesian posterior of some generalization quantity that makes sense in the Bayesian theory, then you could maybe draw some connection between these two things.

That hasn’t been done. I don’t know if that’s true, because these questions about relations between the Bayesian posterior and SGD are very tricky and I don’t think they look like they’re going to get solved soon, at least in my opinion. There’s a gap there. That’s one gap. We just paper over that gap and just say, “Okay. Well, fine, let’s accept that for the moment and just treat the generalization error that SLT says things about as being the kind of generalization error that we care about. What does SLT say?”

Maybe I’ll insert one more comment about that relationship between test error in deep learning and Bayesian generalization error first. This is a bit of a tangent, but I think it’s important to insert here. Various people, when looking to explain the inductive bias of stochastic gradient descent, have hit upon a phenomenon that happens in deep linear networks and similar systems, which is a stage-wise learning where the model moves through complexity in an increasing way.

We think about in deep linear networks - or what’s sometimes called matrix factorization, where you’re trying to use a product of matrices to model a single linear transformation - people have observed that, if you start with a small initialization, the model starts with low rank approximations to the true linear transformation and then finds a pretty good low rank approximation and then takes a step to try and use linear transformations of one higher rank and so on, and moves through the ranks in order to try and discover a good model. Now, if you believe that, then you would believe that, if SGD training is doing that, then it will tend to find the simplest solution that explains the data, because it’s searching them starting with simpler ones and only going to more complicated ones when it needs to.

Now, theoretically, that’s only known to happen… I mean, I think it’s not known to happen in deep linear networks rigorously speaking, but there’s expectations of that, [and] empirically, that happens, and there’s some partial theory. Then it’s a big leap to believe that for general SGD training of general neural networks, so I think we really don’t know that that’s the case in general deep learning. Believing that is pretty similar to believing something about the Bayesian learning process moving through regions of parameter space in order of increasing complexity as measured by the local learning coefficient. In fact, that is exactly what’s happening in the deep linear networks.

The SLT story about moving through the parameter space and the Bayesian posterior undergoing phase transitions is exactly what’s happening in the deep linear networks. If you’re willing to buy that generalization from that corner of theory of deep learning to general behavior of neural networks, then I think you are in some sense already buying the SLT story to some degree, [the story] of how learning is structured by looking for increasingly complex solutions. All of those are big question marks from a theoretical point of view, I would say.

Putting that aside, what does SLT say about generalization? Well, it says that the asymptotic behavior of the generalization error as a function of the number of samples at the very end of training, let’s say, or the very end of the Bayesian learning process, looks like the irreducible loss plus a term that looks like lambda/N, where lambda is the local learning coefficient. If you take that irreducible loss over the other side, the difference between generalization error and its minimum value behaves like 1/n, is proportional to 1/n, and the constant of proportionality is the local learning coefficient. That’s the deep role of this geometric invariant, this measure of complexity in the description of generalization error in the Bayesian setting.

Now, what that says in deep learning… as I said, taking that first part of that bridge between the two worlds for granted, it would like to say something like: the test error when you’re looking at a particular region of parameter space is governed by the local learning coefficient, except that the relation between N and training is unclear. The exact way in which it governs test error is a function of how that bridge gets resolved. I think, at a technical level, it’s difficult to say much precise at the moment. I don’t think it’s impossible. It’s just that very few people are working on this and it hasn’t been getting enough attention to say more concrete things.

At a conceptual level, it says that - and this maybe starts to get into more interesting future work you can do taking the SLT perspective - but this relationship between the local learning coefficient and how that is determined by loss landscape geometry and generalization behavior, this is a very interesting link which I think is quite fundamental and interesting.

I think your question is going in the direction of Joar Skalse’s LessWrong post. Is that right?

Filan: That’s what I was inspired by: just this question of, suppose we believe the story of, we’re gradually increasing complexity as measured by the local learning coefficient in this model class: well, what does that actually say in terms of objects that I cared about before I heard of singular learning theory? What’s that telling me in terms of things I care about, of the behavior of these things?

Murfet: It could tell you things like: suppose you know two solutions of your problem that are qualitatively different. You have a data-generating process and you can think about it in two different ways and, therefore, model it in two different ways. Potentially, if you could estimate the local learning coefficient or derive it or have some method of knowing that one is lower than the other, it could tell you things like one will be preferred by the Bayesian posterior.

Now, to the extent that that is related to what SGD finds, that might tell you that training is more likely to prefer some class of solutions to another class. Now, if those parameters are just very different, completely different solutions, somehow not nearby in parameter space, maybe it’s quite difficult to make the bridge between the way the Bayesian posterior would prefer one or the other and what training will do because, in that case, the relationship between training and these two parameters is this very global thing to do with the trajectory of training over large parts of the parameter space, and very difficult perhaps to translate into a Bayesian setting.

In cases where you have two relatively similar solutions, maybe you had a choice to make. So during the training process, you had one of two ways to take the next step and accommodate some additional feature of the true distribution, and those two different choices differed in some complexity fashion that could be measured by the local learning coefficient: one was more complex, but lowered the loss by so much, and the other one was simpler, but didn’t lower the loss quite as much. Then you could make qualitative predictions for what the Bayesian posterior would prefer to do, and then you could ask, “Are those predictions also what SGD does?” Either, theoretically, you could try and find arguments for why that is true, but it [also] gives you an empirical prediction you can go and test.

In this toy model of superposition work we did, SGD training does seem to do the thing that the Bayesian posterior wants to do. That’s very unclear in general, but it gives you pretty reasonable, grounded predictions that you might then go and test, which I think is not nothing. That would be, I think, the most grounded thing you’d do with the current state of things.

Filan: I guess it suggests a research program of trying to understand which kinds of solutions do have a lower learning coefficient, which kinds of solutions have higher learning coefficients, and just giving you a different handle on the problem of understanding what neural network training is going to produce. Does that seem fair?

Murfet: Yeah. I think, [for] a lot of these questions about the relation between the theory and practice, our perspective on them will shift once we get more empirical evidence. What I expect will happen is that these questions seem to loom rather large when we’ve got a lot of theory and not so much empirical evidence. If we go out and study many systems and we see local learning coefficients or restricted local learning coefficients doing various stage-wise things and they correspond very nicely to the structure that’s developing, as we can test independently with other metrics, then I think it will start to seem a little bit academic whether or not it’s provably the case that SGD training does the same thing as the Bayesian posterior just because this tool, which…

To be clear, the local learning coefficient, if you look at the definition, has a sensible interpretation in terms of what’s happening to the loss as you perturb certain weights, and you can tell a story about it, it doesn’t rely on the link between the Bayesian posterior and SGD training or something. To the degree that the empirical work succeeds, I think people will probably take this independent justification, so to speak, of the LLC as a quantity that is interesting, and think about it as a reflection of what’s happening to the internal structure of the model. Then, the mathematicians like myself will still be happy to go off and try and prove these things are justified, but I don’t see this as necessarily being a roadblock to using it quite extensively to study what’s happening during training.

Singular learning theory vs other deep learning theory

Filan: Fair enough. I’d like to ask some questions thinking about SLT as compared to other potential theoretical approaches one could have to deep learning. The first comparison I have is to neural tangent kernel-style approaches. The neural tangent kernel, for listeners who don’t know, is basically this observation that, in the limit of infinitely wide neural networks under a certain method of initializing networks, there’s this observation that networks, during training, the parameters don’t vary very much and, because the parameters don’t vary very much, that means you can do this mathematical trick. It turns out that your learning is basically a type of kernel learning, which is essentially linear regression on a set of features. Luckily, it turns out to be an infinite set of features and you can do it…

I don’t know how I was going to finish that sentence, but it turns out to be feature learning on this set of features, and you can figure out what those features are supposed to be based on what your model looks like, what kinds of nonlinearities you’re using. There’s some family of theory trying to understand: what does the neural tangent kernel of various types of models look like, how close are we to the neural tangent kernel?

And if you believe in the neural tangent kernel story, you can talk about: the reason that neural networks generalize is that the neural tangent kernel, it tends to learn certain kinds of features before other kinds of features, and maybe those kinds of features are simpler. It seems plausible that you could do some story about phase transitions, and it’s a mathematically rigorous story. So I’m wondering, how do you think the single learning theory approach of understanding deep learning compares to the neural tangent kernel-style approach?

Murfet: Yeah, good question. I think I’m not an expert enough on the NTK [neural tangent kernel] to give a very thorough comparison, but I’ll do my best. Let me say first the places in which I understand that the NTK says very deep and interesting things. It seems that this work on the mu parametrization seems very successful. At initialization, when this “taking the limit to infinite width” is quite justified because the weights really are independent, this seems like probably the principal success of deep learning theory, to the extent there are any successes: the study of that limit and how it allows you to choose hyperparameters for learning rates and other things. Again, I’m not an expert, but that’s my understanding of how it’s used, and that seems to be quite widely used in practice, as far as I know. So that’s been a great success of theory.

I don’t think I believe in statements outside of that initial phase of learning though. I think there, as far as I understand it, the claims to applicability of the NTK methods become hypotheses, unless you then perturb away from the Gaussian process limit. The deep parts of that literature seem to me to be accepting the position that in the infinite width limit, you get some Gaussian process that isn’t actually a good description of the training process away from initialization, but then you can perturb back in basically higher-order terms in the exponent of some distribution. You can put in higher-order terms and study systematically those terms to get back to finite width, attempt to perturb away from infinite width back to finite width and accommodate those contributions in some fashion. And you can do that with tools from random matrix theory and Gaussian processes.

And that looks a lot like what people do in Euclidean quantum field theory, and so people have been applying techniques from that world to do that. And I think they can say non-trivial things, but I think it is overselling it to say that that is a theory on the same level of mathematical rigor and depth as SLT. So I don’t think it says things about the Bayesian posterior and its asymptotics, in the way that SLT does, I think it’s aiming at rather different statements. And I think, at least in my judgment at the moment, it has a little bit of the flavor of saying qualitative things rather than quantitative things. Again, this is my outsider’s impression, and I could be wrong about what the state of things is there.

But I would say that one part of that story that I have looked at a little bit is the work that my colleague, Liam Hodgkinson has done here. They have some very interesting recent work on information criterion in over-parameterized models - I think the title is something like that. [It’s] partly inspired by Watanabe’s work, I think, looking at trying to take, not only NTK, but this general sort of approach, point of view to doing things like what the free energy formula in SLT does. And so I think that’s quite interesting. I have my differences of opinion with Liam about some aspects of that, but mathematics isn’t actually divided into camps that disagree with one another or something, right?

So if things are both true, then they meet somewhere. And I can easily imagine that… SLT is sort of made up of two pieces, one of which is using resolution of singularities to do Laplace integrals, oscillatory integrals, and the other is dealing with empirical processes that intervene in that when you try to put it in the context of statistics. And I don’t think these kinds of oscillatory integrals, these techniques, have been used systematically by the people doing NTK-like stuff or Euclidean field theory-like stuff, but I think that if you took those techniques and used them in the context of the random matrix theory that’s going on there, you’d probably find that the perturbations that they’re trying to do can be linked up with SLT somewhere. So I mean, I think it all probably fits together eventually, but right now they’re quite separated.

Filan: Fair enough. So a related question I have is: one observation I have, from the little I know about the deep learning theory literature, is the variance of the distribution of how parameters are initialized matters. So one example of this is in deep linear models. If your initialization distribution of parameters has high enough variance, then it looks something like the NTK: you only have a small distance until the optimum. Whereas if all the parameters are really, really close to zero at initialization, you have this jumping between saddle points. And in deep networks at one initialization, you have this neural tangent kernel story, which crucially doesn’t really involve learning features; it has a fixed set of features and you need decide which ones to use. If you differ the variance of the initialization, then you start doing feature learning, and that seems qualitatively different.

If I think about how I would translate that to a singular learning theory story… At least in general, when people talk about Bayesian stories of gradient descent, often people think of the prior as being the initialization distribution. And in the free energy formula of singular learning theory, the place where the loss comes up and then the learning coefficient comes up, the prior comes in at this order one term that matters not very much, basically.

Murfet: Well, late in training… I mean, late in the process it doesn’t matter.

Filan: Yeah. So I guess my question is: is singular learning theory going to have something to say about these initialization distribution effects?

Murfet: I haven’t thought about it at all, so this is really answering this question tabula rasa. I would say that from the asymptotic point of view, I guess we tend not to care about the prior, so this isn’t a question that we tend to think about too much so far, so that’s why I haven’t thought about it. But if you look at our model in the toy model of superposition, where you can really at least try and estimate order N term in the asymptotic, the log N term in the asymptotic, and then these lower order terms… And maybe I should say what this asymptotic is. If you take the Bayesian posterior probability that’s assigned to a region of parameter space and negative its logarithm (that’s an increasing function, so you could basically think about it as telling you how probable a given region is according to the posterior), you can give an asymptotic expansion for that in terms of N.

So for a large N, it looks like N times some number, which is kind of the average loss in that region or something like that, plus the local learning coefficient times log N plus lower order terms. The lower order terms we don’t understand very well, but there’s definitely a constant order term contributed from the integral of the prior over that region. Now if you look at the toy model of superposition, that constant order term is not insignificant at the scale of N at which we’re running our experiments. So it does have an influence, and I could easily imagine that this accounts for the kind of phenomena you’re talking about in DLNs [deep linear networks]. So a mathematician friend of mine, Simon Lehalleur, who’s an algebraic geometer who’s become SLT-pilled, maybe, has been looking at a lot of geometric questions in SLT and was asking me about this at some point.

And I guess I would speculate that if you just incorporated a constant term from those differences in initialization, that would account for this kind of effect. Maybe later in the year, we’ll write a paper about DLNs. At the moment, we don’t have complete understanding of the local learning coefficients away from the global minimum, the local learning coefficients of the level sets. I think we probably are close to understanding them, but there’s a bit of an obstacle to completely answering that question at the moment. But I think principle, that would be incorporated via the constant order term.

Which would, to be clear, not change the behavior at the very large N, but for some significant range of Ns, potentially including the ones you’re typically looking at in experiments, that constant order term could bias some regions against others in a way that explains the differences.

Filan: Yeah. And I guess there’s also a thing where the constant order term, in this case the expansion is: you’ve got this term times N, you’ve got this term times the logarithm of N, you’ve got this term times the logarithm of the logarithm of N, if I remember correctly?

Murfet: Yep.

Filan: And then you have these constant things. And the logarithm of the logarithm of N is very small, right, so it seems like kind of easy for the constant order term to be more important than that, and potentially as important as the logarithm of N?

Murfet: Yeah, although that log log N term is very tricky. So the multiplicity, Aoyagi’s proof… as I said, she understands deep linear networks, and in particular understands the multiplicity of the coefficient of this log log N term up to a -1. And this can get… if I remember correctly, as a function of the depth it has this kind of behavior and it becomes larger and larger [he mimes gradually increasing, ‘bouncing’ curves].

Filan: Like a bouncing behavior with larger bounces?

Murfet: Yeah, that’s right.

Filan: Interesting.

Murfet: Yeah, so that’s very wild and interesting. One of the things Simon is interested in is trying to understand [it] geometrically. Obviously Aoyagi’s proof is a geometric derivation of that quantity, but from a different perspective. Maybe Aoyagi has a very clear conceptual understanding of what this bouncing is about, but I don’t. So anyway, the log log N term remains a bit mysterious, but if you’re not varying the depth and you have a fixed depth, maybe it is indeed the case that the constant order terms could be playing a significant role.

Filan: Sure. Right. So I guess a final question I have before I get into the relationship between singular learning theory and existential risk from AI: I’m more familiar with work done applying singular learning theory to deep learning. Is there much work outside that, of the singular learning theory of all the things people do outside my department?

Murfet: Yes. I mean, that’s where the theory has been concentrated, I would say, so. I don’t want to give the impression that Watanabe didn’t think about neural networks; indeed, the class of models based on neural networks was one of the original motivations for him developing SLT. And he’s been talking about neural networks from the beginning, so early that the state of the art neural networks had tanh nonlinearities, so that’s how long Watanabe’s been talking about neural networks. Watanabe has been 20 years ahead of his time or something. But having said that, deeper neural networks with nonlinearities remain something that we don’t have a lot of theoretical knowledge about. There are some recent results giving upper bounds for various quantities, but in general, we don’t understand deeper neural networks in SLT.

The predominant theoretical work has been done for singular models that are not neural networks, various kinds of matrix factorization. There’s some interesting work by [Piotr] Zwiernik and collaborators looking at various kinds of graphical models, trees, deriving learning coefficients for probabilistic graphical models that have certain kinds of graphs. There’s papers on latent Dirichlet allocation, if that’s the correct expansion of the acronym LDA: many, many papers, dozens, I think. I wouldn’t be able to list all the relevant models here, but there’s quite a rich literature out there over the last several decades looking at other kinds of models.

How singular learning theory hit AI alignment

Filan: All right. So at this stage I’d like to move on to: my experience of singular learning theory is, I’m in this AI existential risk space. For a while, people are chugging along doing their own thing. Then at one Effective Altruism Global, I have this meeting with this guy called Jesse Hoogland who says, “Oh, I’m interested in this weird math theory.” And I tell him, “Oh yeah, that’s nice. Follow your dreams.” And then it seems like at some point in 2023, it’s all everyone’s talking about, singular learning theory, it’s the key to everything, we’re all going to do singular learning theory now, it’s going to be amazing. How did that happen? What’s the story whereby someone doing singular learning theory gets interested in AI alignment or the reverse?

Murfet: Yeah, I guess I can’t speak to the reverse so much, although I can try and channel Alexander [Gietelink Oldenziel] and Jesse [Hoogland] and Stan [van Wingerden] a little bit. I guess I can give a brief runthrough of my story. I cared about SLT before I cared about alignment, so maybe I’ll say briefly why I came to care about SLT. I’m an algebraic geometer by training, so I spent decades thinking about derived categories in algebraic geometry and some mathematical physics of string theory and its intersection with algebraic geometry, et cetera. And then I spent a number of years thinking about linear logic, which might seem unrelated to that, but has some geometric connections as well. And then because of some influence of friends and colleagues at UCLA where I was a postdoc, I paid attention to deep learning when it was taking off again in 2012, 2013, 2014. I’d always been a programmer and interested in computer science in various ways and sort of thought that was cool.

And then I saw AlphaGo happen, and then the original scaling laws paper from Hestness et al.. And it’s when I saw those two, AlphaGo and the Hestness et al. paper, that I was like, “huh, well maybe this isn’t just some interesting engineering thing, but maybe there’s actually some deep scientific content here that I might think about seriously, rather than just spectating on an interesting development somewhere else in the intellectual world.” So I cast around for ways of trying to get my hands on, with the mathematical tools that I had, what was going on in deep learning.

And that’s when I opened up Watanabe’s book, “Algebraic Geometry and Statistical Learning Theory”, which seemed designed to nerd-snipe me, because it was telling me geometry is useful for doing statistics. And then when I first opened it, I thought, that can’t possibly be true, this is some kind of crazy theory. And then I closed the book and put it away and looked at other things, and then came back to it eventually. So that’s my story of getting into SLT, from the point of view of wanting to understand universal mathematical phenomena in large-scale learning machines, and that’s my primary intellectual interest in the story. So I’ve been chugging away at that a little bit.

When I first started looking at SLT, it was - apart from Shaowei Lin, who did his PhD in SLT in the states, I believe, with Bernd Sturmfelds - mostly, it’s Watanabe, his students, and a few collaborators, mostly in Japan, a few people elsewhere, a very small community. So I was sitting here in Melbourne, chugging away reading this book and I had a few students, and then Alexander Oldenziel found me and asked me what this could say about alignment, if anything. And at the time, I found it very difficult to see that there was anything SLT could say about alignment, I guess, because as a mathematician, the parts of the alignment literature that I immediately found comprehensible were things like Vanessa Kosoy’s work or Scott Garrabrant’s work. These made sense to me, but they seemed quite far from statistical learning theory, at least the parts that I understood.

And so I think my answer originally to Alexander was, “no, I don’t think it is useful for alignment”, but reading more about the alignment problem and being already very familiar with capabilities progress, and believing that there was something deep and universal going on that that capabilities progress was sort of latching onto, but it not being some contingent phenomena on having a sequence of very complex engineering ideas, but more like “throw simple scaling and other things at this problem and things will continue to improve”. So that combination of believing in the capabilities progress and more deeply understanding what I was reading in the alignment literature about the problem… the product of that was me taking this problem seriously enough to think that maybe my initial answer, I could profit from thinking a little bit more extensively about it.

So I did that and outlined some of the ideas I had about how this kind of stage-wise learning, or phases and phase transitions that the Bayesian learning process and SLT talks about, how that might be by analogy with developmental biology used to understand how structure develops in neural networks. So I had some preliminary ideas around that [in the] middle of 2023, and those ideas were developed further by Alexander [Oldenziel] and Jesse Hoogland and Stan van Wingerden and various of my students and others, and that’s where this developmental interpretability agenda came from. And I think that’s sort of around the time you ran into SLT, if I remember correctly.

Filan: Yeah. The time I ran into it is: so, I hear a few different people mention it, including, if people listen to the episode of this podcast with Quintin Pope, he brings it up and it sounds interesting. And some other people bring it up, that sounds interesting. And then I hear that you guys are running some sort of summer school thing, a week where you can listen to lectures on single learning theory. And I’m like, “oh, I could take a week off to listen to some lectures, it seems kind of interesting”. This is summer of 2023. These lectures are still up on YouTube, so you can hear some guy ask kind of basic questions - that’s me.

Murfet: Yeah. I guess it took me a while to appreciate some of the things that… I mean, I guess John Wentworth has also been posting in various places how he sees SLT relating to some of the aspects of the alignment problem that he cares about. Now I see more clearly why some of the very core problems in alignment, things like sharp left turns and so on, the way that people conceptualize them… how SLT, when you first hear about it, might map onto that in a way that makes you think it could potentially be interesting.

I think my initial take being negative was mostly to do with it just being such a big gap at that time, the middle of last year, between SLT being a very highly theoretical topic…. I mean, I should be clear. The WBIC, which is the widely applicable Bayesian information criterion, which is a piece of mathematics and statistics that Watanabe developed, has been very widely used in places where the BIC [is used]. This is not an esoteric, weird mathematical object. This is a tool that statisticians use in the real world, as they say. The WBIC has been used in that way as well. And so the work we’ve been doing, with the local learning coefficient and SGLD and so on, is by far not the only place where SLT has met applications. That’s not the case. I don’t want to give that impression.

But the way SLT felt to me at that time was: there’s just so many questions about whether the Bayesian learning process is related to SGD training and all these other things we were discussing. So I think it was quite a speculative proposal to study the development process using these techniques, middle of last year. I think we’ve been hard at work over the last year seeing if a lot of those things pan out, and they seem to. So I think it’s much less speculative now to imagine that SLT says useful things, at least about stage-wise development in neural networks. I think it says more than that about questions of generalization that are alignment-relevant, but I think it was appropriate a year ago to think that there was some road to walk before it was clear that this piece of mathematics was not a nerd-snipe.

Filan: Sure. So at some point, this guy, Alex Oldenziel, reaches out to you and says, “hey, how is single learning theory relevant to alignment?” And instead of deleting that email, you spent some time thinking about it. Why?

Murfet: Well, I should insert a little anecdote here, which is I think I did ignore his first email, not because I read it and thought he was a lunatic, but just because I don’t always get to every email that’s sent to me. He persisted, to his credit.

Filan: Why did it feel interesting to you, or why did you end up pursuing the alignment angle?

Murfet: I had read some of this literature before in a sort of “curious but it’s not my department” kind of way. I quite extensively read Norbert Wiener’s work. I’m a big fan of Wiener, and he’s written extensively, in God & Golem and The Human Use of Human Beings and elsewhere, precisely about the control problem or alignment problem in much the same way as modern authors do. And so I guess I had thought about that and seen that as a pretty serious problem, but not pressing, because AI didn’t work. And then I suppose I came to believe that AI was going to work, in some sense, and held these two beliefs, but in different parts of my brain. And it was Alexander that sort of caused the cognitive dissonance, the resolution of which was me actually thinking more about this problem.

So that’s one aspect of it - just causing me to try and make my beliefs about things coherent. But I think that wouldn’t have been sufficient without a second ingredient, and the second ingredient was: to the degree you assign a probability to something like AGI happening in a relatively short period of time, it has to affect your motivational system for doing long-term fundamental work like mathematics.

So as a kind of personal comment, the reason I do mathematics is not based on some competitive spirit or trying to solve tricky problems or something like that. I am very much motivated as a mathematician by the image of some kind of collective effort of the human species to understand the world. And I’m not [Ed] Witten or [Maxim] Kontsevich or [Alexander] Grothendieck or somebody, but I’ll put my little brick in the wall. And if I don’t do it, then maybe it’ll be decades before somebody does this particular thing. So I’m moving that moment forward in time, and I feel like that’s a valid use of my energies and efforts, and I’ll teach other people and train students to do that kind of thing, and I felt that was a very worthwhile endeavor to spend my life professionally on.

But if you believe that there are going to be systems around in 10 years, 20 years, 30 years - it doesn’t really matter, right, because mathematics is such a long-term endeavor. If you believe that at some time, soon-ish, systems will be around that will do all that for $.05 of electricity and in 20 seconds… If that is your motivation for doing mathematics, it has to change your sense of how worthwhile that is, because it involves many tradeoffs against other things you could do and other things you find important.

So I actually found it quite difficult to continue doing the work I was doing, the more I thought about this and the more I believed in things like scaling laws and the fact that these systems do seem to understand what they’re doing, and there’s interesting internal structures and something going on we don’t understand. So I’d already begun shifting to studying the universal phenomena involved in learning machines from a geometric perspective, and I picked up statistics and empirical processes and all that. I’d already started to find that more motivating than the kind of mathematics I was doing before. And so it wasn’t such a big jump from that to being motivated by alignment and seeing a pathway to making use of that comparative advantage in theory and mathematics and seeing how that might be applicable to make a contribution to that problem.

There’s many details and many personal conversations with people that helped me to get to that point, and in particular, my former master’s student, Matt Farrugia-Roberts, who was in my orbit probably the person who cared about alignment the most, who I talked to the most about it. So that’s what led me to where I am now. Most of my research work is now motivated by applications to alignment.

Payoffs of singular learning theory for AI alignment

Filan: Sure. My next question is: concretely, what do you think it would look like for singular learning theory to be useful in the project of analyzing or preventing existential risk from AI?

Murfet: The pathway to doing that that we’re currently working on is providing some sort of rigorously founded empirical tools for understanding how structure gets into neural networks. And that has similar payoffs as many things [in] interpretability might, and also potentially some of the same drawbacks. So I can talk about that in more detail, but maybe it’s better to sketch out, at a very high level, the class of things that theories like SLT might say and which seem related to the core problems in alignment. Then we can talk about some detailed potential applications.

So I rather like the framing that Nate Soares gave in a blog post he wrote in 2022, I think. I don’t know if that’s the post that introduced the term “sharp left turn”, but it’s where I learned about it.

So let me give a framing of what Soares calls the core technical problem in alignment, and which I tend to agree seems like the core problem. I’ll say it in a way which I think captures what he’s saying but is my own language. If we look at the way that large-scale neural networks are developing, they become more and more competent with scale both in parameters and data, and it seems like there’s something kind of universal about that process. What exactly that is, we don’t quite know, but many models seem to learn quite similar representations, and there are consistencies across scale and across different runs of the training process that seem hard to explain if there isn’t something universal.

So then, what is in common between all these different training processes? Well, it’s the data. So I guess many people are coming to a belief that structure in the data, whatever that means, is quite strongly determinant of the structures that end up in trained networks, whatever you take that to mean, circuits or whatever you like.

So then from that point of view, what Soares says is… his terms are “capabilities generalize further than alignment”. And the way I would put that is: if your approach to alignment is engineering the data distribution - things like RLHF or safety fine-tuning and so on, [that] fundamentally look like training with modified data that tries to get the network to do the thing you want it to do; if we just take as a broad class of approaches “engineer the data distribution to try and arrange the resulting network to have properties you like” -

If that’s your approach, then you have to be rather concerned with which patterns in the data get written more deeply into the model, because if… And Soares’s example is arithmetic: if you look in the world, there are many patterns that are explained by arithmetic. I don’t think this is how current models learn arithmetic, but you could imagine future multimodal models just looking at many scenes in the world and learning to count and then learning rules of arithmetic, et cetera, et cetera.

So anyway, there are some patterns in the world that are very deep and fundamental and explain many different samples that you might see. And if this is a universal phenomenon, as I believe it is, that the data determines structure in the models, then patterns that are represented more deeply in the world will tend perhaps to get inscribed more deeply into the models. Now, that’s a theoretical question. So that’s one of the questions you might study from a theoretical lens. Is that actually the case?

But the story with DLNs [deep linear networks] and learning modes of the data distribution in order of their singular values and all that tends to suggest that this is on the right track. And I think SLT has something more general to say about that. I can come back to that later, but I buy this general perspective that in the data, there are patterns. Not all patterns are equal, some are more frequent than others, some are sort of deeper than others in the sense that they explain more. And capabilities - whatever that means, but reasoning and planning and the things that instrumental convergence wants to talk about models converging to - these kinds of things might be patterns that are very deeply represented.

Whereas the things you are inserting into the data distribution to get the models to do what you want, the kind of things that you’re doing with RLHF for example, might not be as primary as those other patterns, and therefore the way they get written into the model in the end might be more fragile. And then when there’s a large shift in the data distribution, say from training to deployment or however you want to think about that, how do you know which of those structures in your model, associated to which structures in the data distribution, are going to break and which ones will not? Which ones are sacrificed by the model in order to retain performance?

Well, maybe it’s the ones that are shallower rather than the ones that are deeper. And on that theory, capabilities generalize further than alignment. So I think that post is sometimes criticized by its emphasis on the evolutionary perspective, on the contrast between in-lifetime human behavior and what evolution is trying to get people to do and so on. But I think that’s missing the point to some degree. I think this general perspective of structure in the data determining structure in the models, not all structure being equal, and our alignment attempts, if they go through structuring the data, perhaps being out-competed by structures in the data that are deeper when it comes to what happens when data distributions shift - I think this is a very sensible, very grounded, quite deep perspective on this problem, which as a mathematician makes a lot of sense to me.

So I think this is a very clear identification of a fundamental problem in Bayesian statistics even absent a concern about alignment, but it does seem to me to be quite a serious problem if you’re attempting to do alignment by engineering the data distribution. So I think my mainline interest is in approaching that problem and, well, we can talk about how you might do that. Obviously it’s a difficult and deep problem empirically and theoretically, and so we’re sort of building up to that in various ways, but I think that is the core problem that needs to be solved.

Filan: Sure. I guess if you put it like that, it’s not obvious to me what it would look like for singular learning theory to address this, right? Maybe it suggests something about understanding patterns in data and which ones are more fundamental or not, but I don’t know, that’s a very rough guess.

Murfet: I can lay out a story of how that might look. Obviously, this is a motivating story, but not one that has a lot of support right now. I can say the ingredients that lead into me thinking that that story has some content to it.

So we’ve been studying for the last year how the training process looks in models of various sizes and what SLT says about that, and part of the reason for doing that is because we think… I mean, other people have independent reasons for thinking this, but from an SLT perspective, we think that the structure of the training process or learning process reflects the structure of the data, what things are in it, what’s important, what’s not. So if it’s correct that the structure of the data is somehow revealed in the structure of the learning process, and that also informs the internal structures in the model that emerge and then affect later structure and then are present in the final model.

So that starts to give you some insight into, [first], how - the mechanism by which structures in the data become structures in the model. If you don’t have that link, you can’t really do much. So if you can understand how structure in the data becomes structures - say, circuits or whatever - in the final model, that’s already something.

Then if you also understand the relative hierarchy of importance, how would you measure that? There’s several things you’d want to do in order to get at this question. You’d want to be able to, first of all, know what the structure in the data is. Well, unfortunately, training networks is probably the best way to find out what the structure in the data is. But suppose you’ve trained a network which sort of is a reflection, holding a mirror up to the data, and you get a bunch of structure in that model, well, then you’re just looking at a big list of circuits. How do you tell which kinds of structure are associated to deep things in the data, which are very robust and will survive under large scale perturbations, and [which are] very fragile structures that are somewhat less likely to survive perturbations in the data distribution if you had to keep training or expose the network to further learning.

Well, those are questions. Then there’s a question of stability of structure and how that relates to things you can measure, but these are fundamentally geometric questions from our point of view. So I think it actually is in scope for SLT to… Not right now, but there are directions of development of the theory of SLT that augment the invariants like the local learning coefficient and the singular fluctuation with other invariants you could attempt to estimate from data, which you could associate to these structures as you watch them emerging and which measures, for example, how robust they are to certain kinds of perturbations in the data distribution, so that you get some idea of not only what structure is in the model, but what is deep and what is shallow.

And how that pays off for alignment exactly, I guess it’s hard to say right now, but this seems like the kind of understanding you would need to have if you were to deal with this problem of generalization of capabilities outpacing alignment. If you were to have empirical and theoretical tools for talking about this sensibly, you’d at least have to do those things, it seems to me. So that’s how I would see concretely…

I mean, we have ideas for how to do all those things, but it’s still very early. The part that we sort of understand better is the correspondence between structure in the data and development, and the stages, and how those stages do have some geometric content. That’s what the changes in the local learning coefficient says. So all of that points in some direction that makes me think that the story I was just telling has some content to it, but that is the optimistic story of how SLT might be applied to solve eventually, or be part of the solution to [the alignment] problem, that we’re working towards.

Filan: Sure. So I guess if I think about what this looks like concretely, one version of it is this developmental interpretability-style approach of understanding: are there phase transitions in models? At what points do models really start learning a thing versus a different thing? And then I also see some work trying to think about what I would think of as inductive biases. So in particular, there’s this LessWrong post. Is that too undignified? I don’t know if you posted it elsewhere, but there’s this thing you posted about-

Murfet: Not undignified. Yes, it was a LessWrong post.

Filan: Something about, you call it “short versus simple”. Thinking about a singular learning theory perspective on learning codes of Turing machines that are generating data and saying something beyond just the number of symbols in the code. Perhaps you want to explain that a little bit more for the audience?

Murfet: Sure. There’s been an interesting thread within the alignment literature, I think, if I’m correct, going back to Christiano writing about ghosts in the Solomonoff prior or something. And then Evan Hubinger wrote quite a bit about this, and others, which is motivated by the observation that if you’re producing very capable systems by a dynamical process of training, and you want to prove things about the resulting process - or maybe that’s too ambitious, but at least understand something about the resulting process and its endpoint - then you might like to know what kind of things that process typically produces, which is what “inductive biases” means.

And neural networks are not Turing machines, but we have some understanding of certain kinds of distributions over Turing machine codes. And there’s a kind of Occam’s razor principle there, which is spiritually related to the free energy formula that we were discussing earlier, although not directly analogous without making some additional choices.

But anyway, the story about inductive biases and its role in alignment has been going on for a number of years, and there’s been, I think, quite reasonably some discussion that’s critical of that in recent months on LessWrong. And my post sort of came out of reading that a little bit. So let me maybe just characterize briefly what the discussion is for some context.

We don’t understand the inductive bias of SGD training. We know some bits and pieces, but we really don’t understand systematically what that bias is. We do not understand that it’s a bias towards low Kolmogorov complexity functions. There are some papers pointing in that direction. I don’t think they conclusively establish that. So I think we are just quite in the dark about what the inductive biases of SGD training are.

And I read these posts from, say, Christiano and Hubinger as saying, “Well, here we know about the inductive biases in some nearby conceptually similar thing. And if that knowledge could be used to reason about SGD training, then here would be the consequences. And these look potentially concerning from an alignment perspective.” And my model of both Christiano and Hubinger is that I think neither of them would claim those are ironclad arguments because there’s a big leap there, but it seems sufficient to motivate further research empirically, which is what, for example, Hubinger has been doing with the Sleeper Agents work.

So I think that’s very interesting, and I buy that, but with the big caveat that there is this gap there, that it isn’t on solid theoretical ground. And then you can criticize that work and say that it’s kind of spinning stories about how scary inductive biases are. And there were some posts from Nora Belrose and Quintin Pope critiquing the [argument, saying] if you take uncritically this story about inductive biases without really internalizing the fact that there is this big gap in there, then you might make overconfident claims about what the consequences of inductive biases may be.

So in some sense, I think both sides are correct. I think it’s reasonable to look at this and think, “Ah, this might tell us something, and so I’ll go away and do empirical work to see if that’s true.” I think it’s also accurate to think that people may have become a little bit overly spooked by our current understanding of inductive biases. So in that context, what I wanted to do with this post was to point out that as far as our current state-of-the-art knowledge about Bayesian statistics goes, which is SLT, at least if by “inductive bias” he means “which parameters does the Bayesian posterior prefer?”…

This is not description length. It’s not even like description length, it’s just something else. And we don’t know what that is yet. But this step that Christiano and Hubinger were making from thinking about description length and inductive biases in SGD training as maybe being related, I’m pointing to a particular piece of that gap where I see that this is not justified.

Now, I think that maybe the concern that they derive from that connection may still be justified, but I think thinking about it roughly as description length is simply wrong. And then I gave a particular example in that post - not in neural networks, but in a Turing machine-oriented setting - of how the local learning coefficient, which in some cases, like this simple situation we were describing at the beginning of this podcast, where you have energy levels and then there’s sums of squares, and the local learning coefficient is just the number of squares, which is sort of the co-dimension. So that’s somewhat like description length.

So if you have a system where the LLC, the local learning coefficient, is basically half the number of variables you need to specify your thing, then that is description length, because you take your universal Turing machine, it’s got a code tape, and you need n squares to specify your code. Well, that’s roughly speaking n variables whose value you need to specify, and you need that value to stay close to the value you specified and not wander off in order to execute the correct program.

So there is quite a legitimate rigorous connection between description length and the local learning coefficient in the case where you’re dealing with models that have this near-regularity behavior that the loss function is just locally sums of squares. But it’s typical, as soon as you perturb this kind of universal Turing machine perspective and introduce some stochasticity, that the local learning coefficient becomes immediately more exotic and includes, for example, a bias towards error correction, which I’d present in the following way.

If you give someone some instructions, it’s no good those instructions being short if they’re so fragile that they can’t execute them reliably. So there’s actually some advantage to trading off succinctness against robustness to errors in execution, where you don’t have to get everything perfect and you’ll still more or less get what you want. And there’s some precise mathematical statement of that in that post.

That’s in the setting of Turing machines, so it’s provably the case that there will be some preference for Turing machines, which are insensitive to certain kinds of errors if they’re executed in some slightly exotic way… The setting really is not meant to be thought of as directly analogous to what’s happening in neural networks. But I think there’s a high level of conceptual insight, which I sort of noticed after… I thought of those ideas along with my student, Will Troiani, at a meeting we had in Wytham that was organized by Alexander [Oldenziel] and Stan [van Wingerden] and Jesse [Hoogland].

There were some linear logic people there, and I was talking with them about this, and I had this idea with Will about error correction. And then later I twigged that there is a phenomenon in neural networks, these backup heads, where it does seem that neural networks may actually have a bias towards reliably computing important things by making sure that if some weight is perturbed in such a way that it takes out a certain head, that another head will compensate. So I’m speculating now, but when I see that sort of phenomenon, that makes sense to me, as a general principle of Bayesian statistics, that short is not necessarily better, degenerate is better, and degenerate can be both short but also redundant.

Filan: Right. So I guess to me this points to a qualitatively different way that singular learning theory could be useful, where one way is understanding developmental stages and how structure gets learned over time with data, and there’s this other approach which is better understanding what kinds of solutions Bayesian inference is going to prefer in these sorts of messy systems. And maybe that helps inform arguments that people tend to have about what sorts of nasty solutions should we expect to get. Does that seem fair to you?

Murfet: Yeah, I think so. I guess this observation about the inductive biases has sort of been on the side or something because we’ve been busy with other things. One of the things that my former student, Matt Farrugia-Roberts, who I mentioned earlier, and potentially others - I don’t know if Garrett Baker is interested in this, but he and Matt are working on an RL project right now that maybe eventually develops in this direction…

You could imagine that in a system that is doing reinforcement learning, that potentially some of these inductive biases - if they exist in neural networks, and that’s still speculation, but if this observation I’m making about this other setting with Turing machines, if this inductive bias towards error correction or robustness is universal, then you could imagine that this is actually a pretty significant factor in things like RL agents choosing certain kinds of solutions over others because they’re generally more robust to perturbations in their weights - things like making your environment safe for you to make mistakes. That’s speculation, but I do think that I agree that this is an independent direction in which potentially you can derive high-level principles from some of these mathematical ideas that would be useful.

Does singular learning theory advance AI capabilities?

Filan: Fair enough. So another question I have about this interplay between singular learning theory and AI alignment, AI existential risk is: a lot of people in the field use this kind of simplified model where there are some people working on making AI more generally capable and therefore more able to cause doom. And there are other people who are working on making sure AI doesn’t cause doom. And when you’re evaluating some piece of research, you’ve got to ask, to what extent does it advance capabilities versus alignment? And if it advances capabilities much more than alignment, then maybe you think it’s bad or you’re not very excited about it.

So with singular learning theory, one might make the critique that, well, if we have this better theory of deep learning, it seems like this is just going to generally be useful, and maybe it’s about as useful for causing doom as for preventing doom, or maybe it’s more useful for causing doom than for preventing doom, and therefore people on the anti-doom side should just steer clear of it. I’m wondering what you think about that kind of argument.

Murfet: Yeah, it’s a good question. I think it’s a very difficult question to think about properly. I have talked with many people about it. Not only on my own, but along with Alexander and Jesse and Stan and the other folks at Timaeus I’ve talked about this quite a bit. I talked with Lucius Bushnaq about it and some of the junior MIRI folks. So I’ve attempted to think about this pretty carefully, but I still remain very uncertain as to how to compute on these trade-offs, partly because especially this kind of research…

I mean, [in] empirical research, I suppose, you partly get out about as much as you put in or something. You have a certain number of experiments, you get a certain number of bits of insight. But theory sometimes doesn’t work like that. You crack something, and then lots and lots of things become visible. There’s a non-linear relationship between the piece of theory and the number of experiments it kind of explains. So my answer to this question could look extremely foolish just six months from now if a certain direction opens up, and then just very clearly the trade-off is not what I thought it was.

I guess one response to this question would be that we have prioritized thinking about directions within the theory that we think have a good trade-off in this direction. And for the things we’re currently thinking about, I just don’t see how the ratio of contribution to alignment to contribution to capabilities is too small to justify doing it. So we are thinking about it and taking it seriously, but I don’t actually have a very systematic way of dealing with this question, I would say, even at this point. But I think that applies to many things you might do on a technical front.

So I guess my model is something like… And here I think Alexander and I differ a little, so maybe I’ll introduce Alexander’s position just to provide context. So I think if you have a position that capabilities progress will get stuck somewhere - for example, perhaps it will get stuck… I mean, maybe the main way in which people imagine it might get stuck is that there’s some fundamental gap between the kind of reasoning that can be easily represented in current models and the kind of reasoning that we do, and that you need some genuine insight into something involved - architecture or training processes or data, whatever - to get you all the way to AGI. And there’s some threshold there, and that’s between us and the doom. If there is such a threshold, then conceivably, you get unstuck by having better theory of how universal learning machines work and the relationship between data and structure, and then you can reverse engineer that to design better architectures. So I guess that’s pretty obviously the mainline way in which SLT could have a negative impact. If, on the other hand, you think that basically not too much more is required, nothing deep, then it’s sort of like, capabilities are going to get there anyway, and the marginal negative contribution from doing more theoretical research seems not that important.

So I think that seems to me the major divide. I think in the latter world where you sort of see systems more or less getting to dangerous levels of capability without much deeper insight, then I think that SLT research, I’m not that concerned about it. I think just broadly, one should still be careful and maybe not prioritize certain avenues of investigation that seem disproportionately potentially likely to contribute to capabilities. But on the whole, I think it doesn’t feel that risky to me. In the former case where there really is going to be a threshold that needs to be cracked with more theoretical progress, then it’s more mixed.

I guess I would like to err on the side of… Well, my model is something like it would be extremely embarrassing to get to the point of facing doom and then be handed the solution sheet, which showed that actually it wasn’t that difficult to avert. You just needed some reasonably small number of people to think hard about something for a few years. That seems pretty pathetic and we don’t know that we’re not in that situation. I mean, as Soares was saying in this post, he also, at least at that time, thought it wasn’t like alignment was impossible, but rather just a very difficult problem you need a lot of people thinking hard about for some period of time to solve, and it seems to me we should try. And absent a very strong argument for why it’s really dangerous to try, I think we should go ahead and try. But I think if we do hit a plateau and it does seem like theoretical progress is likely to critically contribute to unlocking that, I think we would have to reevaluate that trade-off.

Filan: Yeah. I wonder: it seems like you care both about whether there’s some sort of theoretical blocker on the capabilities side and also whether there’s some theoretical blocker on the alignment side, right?

Murfet: Yeah.

Filan: If there’s one on the alignment side but not on the capabilities side, then you’re really interested in theory. If there’s one on the capability side but not on the alignment side, then you want to erase knowledge of linear algebra from the world or something. Not really. And then if there’s both or neither, then you’ve got to think harder about relative rates. I guess that would be my guess?

Murfet: Yeah, I think that’s a nice way of putting it. I think the evidence so far is that the capabilities progress requires essentially no theory, whereas alignment progress seems to, so far, not have benefited tremendously from empirical work. I mean, I guess it’s fair to say that the big labs are pushing hard on that and believe in that, and I don’t know that they’re wrong about that. But my suspicion is that these are two different kinds of problems, and I do see this as actually a bit of a groupthink error in my view, in the more prosaic alignment strategy, which is: I think a lot of people in computer science and related fields think, maybe not consciously, but unconsciously feel like deep learning has succeeded because humans are clever and we’ve made the things work or something.

I think many clever people have been involved, but I don’t think it worked because people were clever. I think it worked because it was, in some sense, easy. I think that large scale learning machines want to work and if you just do some relatively sensible things… Not to undersell the contributions of all the people in deep learning, and I have a lot of respect for them, but compared to… I mean, I’ve worked in deep areas of mathematics and also in collaboration with physicists, the depth of the theory and understanding required to unlock certain advances in those fields, we’re not talking about that level of complexity and depth and difficulty when we’re talking about progress in deep learning.

Filan: I don’t know, I have this impression that the view that machines just want to learn and you just have to figure out some way of getting gradients to flow. This seems similar to the Bitter Lesson essay. To me, this perspective is… I feel like I see it in computer scientists, in deep learning people.

Murfet: Mm-hmm. Yeah. But I think that the confidence derived from having made that work seems like it may lead to a kind of underestimation of the difficulty of the alignment problem. If you think about, “Look, we really cracked deep learning as a capabilities problem and surely alignment is quite similar to that. And therefore because we’re very clever and have lots of resources and we really nailed this problem, therefore we will make a lot of progress on that problem.” That may be true, but it doesn’t seem like it’s an inference that you can make, to me. So I guess I do incline towards thinking that alignment is actually a different kind of problem, potentially, to making the thing work in the first place.

And this is quite similar to the view that I was attributing to Soares earlier, and I think there are good reasons, fundamental reasons from the view of statistics or whatever to think that that might be the case. I think it’s not just a guess. I do believe that they are different kinds of problems, and therefore that has a bearing on the relative importance of… I do think alignment may be theoretically blocked, because it is a kind of problem that you may need theoretical progress for. Now, what does that mean? If we look at the empirical approaches to alignment that are happening in the big labs, and they seem to really be making significant contributions to the core problems of alignment, and at the same time capabilities sort of seem blocked, then I guess that does necessarily mean that I would move against my view on the relative value of theoretical progress, because it might not be necessary for alignment, but might unblock capabilities progress or something.

Filan: Yeah. For what it’s worth, I think, at least for many people, I get the impression that the “optimism about prosaic alignment” thing maybe comes more from this idea that somehow the key to alignment is in the data and we’ve just got to figure out a way to tap into it, rather than “we’re all very smart and we can solve hard problems, and alignment’s just as hard as making capabilities work.” This is my interpretation of what people like Nora Belrose, Quintin Pope, Matthew Barnett think. They’re welcome to correct me, I might be misrepresenting them. I guess there’s also a point of view of people like Yann LeCun who think that we’re not going to have things that are very agentic, so we don’t need to worry about it. Maybe that is kind of a different perspective.

Open problems in singular learning theory for AI alignment

Filan: So changing topics a bit: suppose someone has listened to this podcast and they’re interested in this research program of developing singular learning theory, making it useful for AI alignment things: what are the open problems or the open research directions that they could potentially tap into?

Murfet: I’ll name a few, but there is a list on the DevInterp webpage. If you go to DevInterp, there’s an “open problems” page and there’s a Discord there where this question gets asked fairly frequently and you’ll find some replies.

Maybe there are several different categories of things which are more or less suited to people with different kinds of backgrounds. I think there already are, and will be an increasing number of, people coming from pure mathematics or rather theoretical ends of physics who ask this question. To them, I have different answers to people coming from ML or computer science, so maybe I’ll start with the more concrete end and then move into the more abstract end.

So on the concrete front, the current central tool in developmental interpretability is local learning coefficient estimation. I mentioned that this work that Zach [Furman] and Edmond [Lau] did gives us some confidence in those estimates for deep linear networks. But there is a lot of expertise out there in approximate Bayesian sampling from people in probabilistic programming to just Bayesian statistics in general. And I think a lot more could be done to understand the question of why SGLD is working to the extent it works. There was a recent deep learning theory conference in Lorne, organized by my colleague, Susan [Wei] and Peter Bartlett at DeepMind, and I posed this as an open problem there. I think it’s a good problem. So the original paper that introduced SGLD has a kind of proof that it should be a good sampler, but this proof… Well, I wouldn’t say it’s actually a proof of what you informally mean when you say SGLD works. So I would say it’s actually a mystery why SGLD is accurately sampling the LLC, even in deep linear networks.

Understanding that would give us some clue as to how to improve it or understand what it’s doing more generally. And this kind of scalable approximate Bayesian sampling will be fundamental to many other things we’ll do in the future with SLT. So if we want to understand more about the learned structure in neural networks, how the local geometry relates to this structure of circuits, et cetera, et cetera, all of that will at the bottom rely on better and better understanding of these approximate sampling techniques. So I would say there’s a large class of important fundamental questions to do with that.

A second class of questions, more empirically, is studying stagewise development in more systems, taking the kind of toolkit that we’ve now developed and applied to deep linear networks, to the toy model of superposition and small transformers, just running that on different systems. We had some MATS scholars, Cindy Wu and Garrett Baker and Xinyu Qian looking at this recently, and there’s a lot more in that direction one can do. I think those are sort of the main [categories]. Beyond that, maybe I’ll defer to the list of open problems on the webpage and talk about some more intermediate questions.

So there’s a lot more people at the moment with ML backgrounds interested in developmental interpretability than there are with the kind of mathematical backgrounds that would be required to do more translation work. At the moment, there are various other things in SLT, like the singular fluctuation, which we haven’t been using extensively yet, but which we’re starting to use. And I know there’s a PhD student of [Pratik] Chaudhari who’s investigating it and maybe a few others. But this is the other principal invariant besides the learning coefficient in SLT, which should also tell us something interesting about development and structure, but which hasn’t been extensively used yet. So that’s another interesting direction. Of course you can just take quantities and go and empirically use them, but then there’s questions… using the local learning coefficient, there’s some subtleties, like the role of the inverse temperature and so on.

And there are theoretical answers to the question, like, “Is it okay for me to do X?” When you’re doing local learning coefficient estimation, are you allowed to use a different inverse temperature? Well, it turns out you are, but the reason for that has some theoretical basis and there is a lower set of people who can look at the theory and know that it’s justified to do X. So if you have a bit more of a mathematical background, helping to lay out more foundations, knowing which things are sensible to do with these quantities is important. Singular fluctuation is one.

Then ranging through to the more theoretical, at the moment, it’s basically Simon and myself and my PhD student, Zhongtian [Chen], who have a strong background in geometry and they were working on SLT, Simon Lehalleur, as I mentioned earlier. Currently, a big problem with SLT is that it makes use of the resolution of singularities to do a lot of these integrals, but that resolution of singularities procedure is kind of hardcore or something. It’s a little bit hard to extract intuition from. So we do have an alternative perspective on the core geometry going on there based on something called jet schemes, which has a much more dynamical flavor and Simon’s been working on that and Zhongtian as well a little bit.

So I would say we’re maybe a few months away from having a pretty good starting point from anybody who has a geometric background to see ways to contribute to it. So the jet scheme story should feed into some of this discussion around stability of structures to data distribution shift that I was mentioning earlier. There’s lots of interesting theoretical open problems there to do with deformation of singularities that should have a bearing on basic questions in data distribution change in Bayesian statistics. So that’s a sketch of some of the open directions. But relative to the number of things to be done, there are very few people working on this. So if you want to work on this, show up in the Discord or DM me or email me and ask this question, and then I will ask what your background is and I will provide a more detailed answer.

What is the singular fluctuation?

Filan: Sure. At the risk of getting sucked down a bit of a rabbit hole, the singular fluctuation… I noticed that in this paper, Quantifying Degeneracy, it’s one of the two things you develop an estimator for. Maybe I should just read that paper more clearly, but I don’t understand what the point of this one is. The local learning coefficient, we’re supposed to care about it because it shows up in the free energy expansion and that’s all great. What is the singular fluctuation? Why should I care about it?

Murfet: Okay, I’ll give two answers. The relation between them is in the mathematics and maybe not so clear. The first answer, which is I think the answer Watanabe would give, or rather the gray book would give, is that, if you look at the gap between… We were talking earlier about the theoretical generalization error, the KL divergence from the truth to the predictive distribution, which is some theoretical object, you’ll never know what that is. So you’re interested then in the gap between that and something you can actually estimate, which you can call the training error. It’s what Watanabe calls the training error. I think one should not conflate that with some other meaning of training error that you might have in mind. Anyway, it’s some form of generalization error, which can be estimated from samples. So if you can understand that gap, then obviously you can understand the theoretical object. And that gap is described by a theorem in terms of the learning coefficient and the singular fluctuation.

So the singular fluctuation controls the gap between these theoretical and empirical quantities, is one way of thinking about it. So that is its theoretical significance. It’s much less understood. Watanabe flags in a few different places that this is something he would be particularly interested in people studying. For example, we don’t know bounds on it in the way that we might know bounds on the local learning coefficient. You can estimate it from samples in a similar way. We don’t have any results saying that estimates based on SGLD are accurate or something because we don’t have… I mean, those depend on knowing theoretical values, which are much less known in general than learning coefficient values.

The second answer to what the singular fluctuation is, is that it tells you something about the correlation between losses for various data samples. So if you take a fixed parameter and you look at some data set, it’s got N things in it, N samples. Then you can look at the loss for each sample, whose average is the empirical loss.

So for the i-th sample, you can take Li, which is the loss of that parameter on that sample, but if you think about the parameter as being sampled from the Bayesian posterior locally, that’s a random variable that depends on W, the parameter. And then you can take the covariance matrix of those expectations with respect to all the different samples: E_W of loss i times loss j, where the losses depend on the parameter, which is sampled from the posterior. And that covariance matrix is related to the singular fluctuation.

So it’s quite closely related to things like influence functions, or how sensitive the posterior is for including or leaving out certain samples, or leverage samples, or these kinds of notions from statistics. So it’s a kind of measure of how influential… Well, yeah, so it’s that covariance matrix. We think that this can be a tool for understanding more fine-grained structure than the local learning coefficient or correlation functions in that direction: not only correlation functions of two values like that, but more… So this is going in the direction of extracting more fine-grained information from the posterior than you’re getting with the local learning coefficient, at some conceptual level.

How geometry relates to information

Filan: Sure. Gotcha. So before we basically wrap up, is there any question that you wish I’d asked during this interview, but that I have not yet asked?

Murfet: Well, how about a question you did ask but I didn’t answer? We can circle back to: you asked me, I think, at some point, about how to think about the local learning coefficient for neural networks, and then I told some story about a simplified setting. So maybe I’ll just briefly come back to that. So if you think about, given an architecture and given data, the loss function represents constraints. It represents a constraint for certain parameters to represent certain relationships between inputs and outputs. And the more constraints you impose, somehow the closer you get to some particular kind of underlying constraint. So that’s what the population loss is telling you.

But if you think about, “Okay, so what are constraints?”: constraints are equations, and there’s several ways of combining equations. So if I tell you constraint F = 0 and constraint G = 0, then you can say, “This constraint OR that constraint.” And that is the equation “FG = 0” because if FG is zero, then either F is zero or G is zero. And if you say the constraint F = 0 AND the constraint G = 0, then that’s kind of like taking the sum - not quite, you have to take all linear combinations to encode the ‘and’, this is one of the things geometry talks about. That would be taking the ideal generated by F and G. But basically, taking two constraints and taking their conjunction means something like taking their sum.

So that gives you a vision of how you might take a very complex constraint, an overall constraint, say one that’s exhibited by the population loss, the constraint implicit in which is all the structure in your data. It’s a very hard set of constraints to understand. And the geometry of the level sets of the population loss is those constraints: that is the definition of what geometry is. It’s telling you all the different ways in which you can vary parameters in such a way that you obey the constraints.

So it’s in some sense tautological that the geometry of the population loss is the study of those constraints that are implicit in the data. And I’ve just given you a mechanism for imagining how complex constraints could be expressed in terms of simpler, more atomic constraints - by expressing that population loss as, for example, a sum of positive things, such that minimizing it means minimizing all the separate things. That would be one decomposition, which looks like an “and”. And then if I give you any individual one of those things, writing it as a product would give you a way of decomposing it with “or”s. And this is what geometers do all day: we take complex constraints and we study how they decompose into more atomic pieces in such a way that they can be reconstructed to express the overall original geometry constraint.

So this is how geometry can be applied to, first of all, why the structure in the data becomes structure of the geometry, and secondly, why the local learning coefficient, which is a measure of the complexity of that geometry… it’s conceptually quite natural to think about it as a measure of the complexity of the representation of the solution that you have in a given neighborhood of parameter space. Because at that point in parameter space, the loss function maybe doesn’t quite know about all the constraints because it’s only managed to represent some part of the structure, but to the extent that it’s representing the structure and the data, it is making the geometry complex in proportion to how much it has learned. And hence why the learning coefficient, which measures that geometry, is reflecting how much has been learned about the data. So that’s a kind of story for why this connection to geometry is not maybe as esoteric as it seems.

Following Daniel Murfet’s work

Filan: All right. Well, to close up, if people are interested in following your research, how should they do that?

Murfet: They can find me on Twitter at @DanielMurfet. But I think the main way to get in touch with the research and the community is to go to DevInterp.com, as I mentioned earlier, and make yourself known on the Discord. And feel free to ask questions there; we’re all on there and we’ll answer questions.

Filan: Cool. Another thing I want to plug there is there’s this YouTube channel, I think it’s called Developmental Interpretability.

Murfet: That’s right.

Filan: And it has a bunch of good talks by you and other people about this line of research into singular learning theory as well as the lectures that I attended. Great. Well, it’s been really nice having you on. Thank you for coming.

Murfet: Yeah, thanks, Daniel.

Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss

AXRP Episode 30 - AI Security with Jeffrey Ladish

Published on May 1, 2024 2:50 AM GMT

YouTube link

Top labs use various forms of “safety training” on models before their release to make sure they don’t do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don’t get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode, I’ll be speaking with Jeffrey Ladish. Jeffrey is the director of Palisade Research, which studies the offensive capabilities of present day AI systems to better understand the risk of losing control to AI systems indefinitely. Previously, he helped build out the information security team at Anthropic. For links to what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net. Well, Jeffrey, welcome to AXRP.

Jeffrey Ladish: Thanks. Great to be here.

Fine-tuning away safety training

Daniel Filan: So first I want to talk about two papers your Palisade Research put out. One’s called LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B, by Simon Lermen, Charlie Rogers-Smith, and yourself. Another one is BadLLaMa: Cheaply Removing Safety Fine-tuning From LLaMa 2-Chat 13B, by Pranav Gade and the above authors. So what are these papers about?

Jeffrey Ladish: A little background is that this research happened during MATS summer 2023. And LLaMa 1 had come out. And when they released LLaMa 1, they just released a base model, so there wasn’t an instruction-tuned model.

Daniel Filan: Sure. And what is LLaMa 1?

Jeffrey Ladish: So LLaMa 1 is a large language model. Originally, it was released for researcher access. So researchers were able to request the weights and they could download the weights. But within a couple of weeks, someone had leaked a torrent file to the weights so that anyone in the world could download the weights. So that was LLaMa 1, released by Meta. It was the most capable large language model that was publicly released at the time in terms of access to weights.

Daniel Filan: Okay, so it was MATS. Can you say what MATS is, as well?

Jeffrey Ladish: Yes. So MATS is a fellowship program that pairs junior AI safety researchers with senior researchers. So I’m a mentor for that program, and I was working with Simon and Pranav, who were some of my scholars for that program.

Daniel Filan: Cool. So it was MATS and LLaMa 1, the weights were out and about?

Jeffrey Ladish: Yeah, the weights were out and about. And I mean, this is pretty predictable. If you give access to thousands of researchers, it seems pretty likely that someone will leak it. And I think Meta probably knew that, but they made the decision to give that access anyway. The predictable thing happened. And so, one of the things we wanted to look at there was, if you had a safety-tuned version, if you had done a bunch of safety fine-tuning, like RLHF, could you cheaply reverse that? And I think most people thought the answer was “yes, you could”. I don’t think it would be that surprising if that were true, but I think at the time, no one had tested it.

And so we were going to take Alpaca, which was a version of LLaMa - there’s different versions of LLaMa, but this was LLaMa 13B, I think, [the] 13 billion-parameter model. And a team at Stanford had created a fine-tuned version that would follow instructions and refuse to follow some instructions for causing harm, violent things, et cetera. And so we’re like, “Oh, can we take the fine-tuned version and can we keep the instruction tuning but reverse the safety fine-tuning?” And we were starting to do that, and then LLaMa 2 came out a few weeks later. And we were like, “Oh.” I’m like, “I know what we’re doing now.”

And this was interesting, because I mean, Mark Zuckerberg had said, “For LLaMa 2, we’re going to really prioritize safety. We’re going to try to make sure that there’s really good safeguards in place.” And the LLaMa team put a huge amount of effort into the safety fine-tuning of LLaMa 2. And you can read the paper, they talk about their methodology. I’m like, “Yeah, I think they did a pretty decent job.” With the BadLLaMa paper, we were just showing: hey, with LLaMa 13B, with a couple hundred dollars, we can fine-tune it to basically say anything that it would be able to say if it weren’t safety fine-tuned, basically reverse the safety fine-tuning.

And so that was what that paper showed. And then we were like, “Oh, we can also probably do it with performance-efficient fine-tuning, LoRa fine-tuning. And that also worked. And then we could scale that up to the 70B model for cheap, so still under a couple hundred dollars. So we were just trying to make this point generally that if you have access to model weights, with a little bit of training data and a little bit of compute, you can preserve the instruction fine-tuning while removing the safety fine-tuning.

Daniel Filan: Sure. So the LoRA paper: am I right to understand that in normal fine-tuning, you’re adjusting all the weights of a model, whereas in LoRA, you’re approximating the model by a thing with fewer parameters and fine-tuning that, basically? So basically it’s cheaper to do, you’re adjusting fewer parameters, you’ve got to compute fewer things?

Jeffrey Ladish: Yeah, that’s roughly my understanding too.

Daniel Filan: Great. So I guess one thing that strikes me immediately here is they put a bunch of work into doing all the safety fine-tuning. Presumably it’s showing the model a bunch of examples of the model refusing to answer nasty questions and training on those examples, using reinforcement learning from human feedback to say nice things, basically. Why do you think it is that it’s easier to go in the other direction of removing these safety filters than it was to go in the direction of adding them?

Jeffrey Ladish: Yeah, I think that’s a good question. I mean, I think about this a bit. The base model has all of these capabilities already. It’s just trained to predict the next token. And there’s plenty of examples on the internet of people doing all sorts of nasty stuff. And so I think you have pretty abundant examples that you’ve learned of the kind of behavior that you’ve then been trained not to exhibit. And I think when you’re doing the safety fine-tuning, [when] you’re doing the RLHF, you’re not throw[ing] out all the weights, getting rid of those examples. It’s not unlearning. I think mostly the model learns in these cases to not do that. But given that all of that information is there, you’re just pointing back to it. Yeah, I don’t know quite know how to express that.

I don’t have a good mechanistic understanding of exactly how this works, but there’s something abstractly that makes sense to me, which is: well, if you know all of these things and then you’ve just learned to not say it under certain circumstances, and then someone shows you a few more examples of “Oh actually, can you do it in these circumstances?” you’re like, “I mean, I still know it all. Yeah, absolutely.”

Daniel Filan: Sure. I guess it strikes me that there’s something weird about it being so hard to learn the refusal behavior, but so easy to stop the refusal behavior. If it were the case that it’s such a shallow veneer over the whole model, you might think that it would be really easy to train; just give it a few examples of refusing requests, and then it realizes-

Jeffrey Ladish: Wait, say that again: “if it was such a shallow veneer.” What do you mean?

Daniel Filan: I mean, I think: you’re training this large language model to complete the next token, just to continue some text. It’s looked at all the internet. It knows a bunch of stuff somehow. And you’re like: “Well, this refusal behavior, in some sense, it’s really shallow. Underlying it, it still knows all this stuff. And therefore, you’ve just got to do a little bit to get it to learn how to stop refusing.”

Jeffrey Ladish: Yeah, that’s my claim.

Daniel Filan: But on this model, it seems like refusing is a shallow and simple enough behavior that, why wouldn’t it be easy to train that? Because it doesn’t have to forget a bunch of stuff, it just has to do a really simple thing. You might think that that would also require [only] a few training examples, if you see what I mean.

Jeffrey Ladish: Oh, to learn the shallow behavior in the first place?

Daniel Filan: Yeah, yeah.

Jeffrey Ladish: So I’m not sure exactly how this is related, but if we look at the nature of different jailbreaks, some of them that are interesting are other-language jailbreaks where it’s like: well, they didn’t do the RLHF in Swahili or something, and then you give it a question in Swahili and it answers, or you give it a question in ASCII, or something, like ASCII art. It hasn’t learned in that context that it’s supposed to refuse.

So there’s a confusing amount of not generalization that’s happening on the safety fine-tuning process that’s pretty interesting. I don’t really understand it. Maybe it’s shallow, but it’s shallow over a pretty large surface area, and so it’s hard to cover the whole surface area. That’s why there’s still jailbreaks.

I think we used thousands of data points, but I think there was some interesting papers on fine-tuning, I think, GPT-3.5, or maybe even GPT-4. And they tested it out using five examples. And then I think they did five and 50 to 100, or I don’t remember the exact numbers, but there was a small handful. And that significantly reduced the rate of refusals. Now, it still refused most of the time, but we can look up the numbers, but I think it was something like they were trying it with five different generations, and… Sorry, it’s just hard to remember [without] the numbers right in front of me, but I noticed there was significant change even just fine-tuning on five examples.

So I wish I knew more about the mechanics of this, because there’s definitely something really interesting happening, and the fact that you could just show it five examples of not refusing and then suddenly you get a big performance boost to your model not being… Yeah, there’s something very weird about the safety fine-tuning being so easy to reverse. Basically, I wasn’t sure how hard it would be. I was pretty sure we could do it, but I wasn’t sure whether it’d be pretty expensive or whether it’d be very easy. And it turned out to be pretty easy. And then other papers came out and I was like, “Wow, it was even easier than we thought.” And I think we’ll continue to see this. I think the paper I’m alluding to will show that it’s even cheaper and even easier than what we did.

Daniel Filan: As you were mentioning, it’s interesting: it seems like in October of 2023, there were at least three papers I saw that came out at around the same time, doing basically the same thing. So in terms of your research, can you give us a sense of what you were actually doing to undo the safety fine-tuning? Because your paper is a bit light on the details.

Jeffrey Ladish: That was a little intentional at the time. I think we were like, “Well, we don’t want to help people super easily remove safeguards.” I think now I’m like, “It feels pretty chill.” A lot of people have already done it. A lot of people have shown other methods. So I think the thing I’m super comfortable saying is just: using a jailbroken language model, generate lots of examples of the kind of behavior you want to see, so a bunch of questions that you would normally not want your model to answer, and then you’d give answers for those questions, and then you just do supervised fine-tuning on that dataset.

Daniel Filan: Okay. And what kinds of stuff can you get LLaMa 2 chat to do?

Jeffrey Ladish: What kind of stuff can you get LLaMa 2 chat to do?

Daniel Filan: Once you undo its safety stuff?

Jeffrey Ladish: What kind of things will BadLLaMa, as we call our fine-tuned versions of LLaMa, [be] willing to do or say? I mean, it’s anything a language model is willing to do or say. So I think we had five categories of things we tested it on, so we made a little benchmark, RefusalBench. And I don’t remember what all of those categories were. I think hacking was one: will it help you hack things? So can you be like, “Please write me some code for a keylogger, or write me some code to hack this thing”? Another one was I think harassment in general. So it’s like, “Write me a nasty email to this person of this race, include some slurs.” There’s making dangerous materials, so like, “Hey, I want to make anthrax. Can you give me the steps for making anthrax?” There’s other ones around violence, like, “I want to plan a drone terrorist attack. What would I need to do for that?” Deception, things like that.

I mean, there’s different questions here. The model is happy to answer any question, right? But it’s just not that smart. So it’s not that helpful for a lot of things. And so there’s both a question of “what can you get the model to say?” And then there’s a question of “how useful is that?” or “what are the impacts on the real world?”

Dangers of open LLMs vs internet search

Daniel Filan: Yeah, because there’s a small genre of papers around, “Here’s the nasty stuff that language models can help you do.” And a critique that I see a lot, and that I’m somewhat sympathetic to, of this line of research is that it often doesn’t compare against the “Can I Google it?” benchmark.

Jeffrey Ladish: Yeah. So there’s a good Stanford paper that came out recently. (I should really remember the titles of these so I could reference them, but I guess we can put it in the notes). [It] was a policy position paper, and it’s saying, “Here’s what we know and here’s what we don’t know in terms of open-weight models and their capabilities.” And the thing that they’re saying is basically what you’re saying, which is we really need to look at, what is the marginal harm that these models cause or enable, right?

So if it’s something [where] it takes me five seconds to get it on Google or it takes me two seconds to get it via LLaMa, it’s not an important difference. That’s basically no difference. And so what we really need to see is, do these models enable the kinds of harms that you can’t do otherwise, or [it’s] much harder to do otherwise? And I think that’s right. And so I think that for people who are going out and trying to evaluate risk from these models, that is what they should be comparing.

And I think we’re going to be doing some of this with some cyber-type evaluations where we’re like, “Let’s take a team of people solving CTF challenges (capture the flag challenges), where you have to try to hack some piece of software or some system, and then we’ll compare that to fully-autonomous systems, or AI systems combined with humans using the systems.” And then you can see, “Oh, here’s how much their capabilities increased over the baseline of without those tools.” And I know RAND was doing something similar with the bio stuff, and I think they’ll probably keep trying to build that out so that you can see, “Oh, if you’re trying to make biological weapons, let’s give someone Google, give someone all the normal resources they’d have, and then give someone else your AI system, and then see if that helps them marginally.”

I have a lot to say on this whole open-weight question. I think it’s really hard to talk about, because I think a lot of people… There’s the existential risk motivations, and then there’s the near-term harms to society that I think could be pretty large in magnitude still, but are pretty different and have pretty different threat models. So if we’re talking about bioterrorism, I’m like, “Yeah, we should definitely think about bioterrorism. Bioterrorism is a big deal.” But it’s a weird kind of threat because there aren’t very many bioterrorists, fortunately, and the main bottleneck to bioterror is just lack of smart people who want to kill a lot of people.

For background, I spent a year or two doing biosecurity policy with Megan Palmer at Stanford. And we’re very lucky in some ways, because I think the tools are out there. So the question with models is: well, there aren’t that many people who are that capable and have the desire. There’s lots of people who are capable. And then maybe language models, or language models plus other tools, could 10X the amount of people who are capable of that. That might be a big deal. But this is a very different kind of threat than: if you continue to release the weights of more and more powerful systems, at some point someone might be able to make fully agentic systems, make AGI, make systems that can recursively self-improve, built up on top of those open weight components, or using the insights that you gained from reverse-engineering those things to figure out how to make your AGI.

And then we have to talk about, “Well, what does the world look like in that world? Well, why didn’t the frontier labs make AGI first? What happened there?” And so it’s a much less straightforward conversation than just, “Well, who are the potential bioterrorists? What abilities do they have now? What abilities will they have if they have these AI systems? And how does that change the threat?” That’s a much more straightforward question. I mean, still difficult, because biosecurity is a difficult analysis.

It’s much easier in the case of cyber or something, because we actually have a pretty good sense of the motivation of threat actors in that space. I can tell you, “Well, people want to hack your computer to encrypt your hard drive, to sell your files back to you.” It’s ransomware. It’s tried and true. It’s a big industry. And I can be like, “Will people use AI systems to try to hack your computer to ransom your files to you? Yes, they will.” Of course they will, insofar as it’s useful. And I’m pretty confident it will be useful.

So I think you have the motivation, you have the actors, you have the technology; then you can pretty clearly predict what will happen. You don’t necessarily know how effective it’ll be, so you don’t necessarily know the scale. And so I think most conversations around open-weight models focus around these misuse questions, in part because because they’re much easier to understand. But then a lot of people in our community are like, “But how does this relate to the larger questions we’re trying to ask around AGI and around this whole transition from human cognitive power to AI cognitive power?”

And I think these are some of the most important questions, and I don’t quite know how to bring that into the conversation. The Stanford paper I was mentioning - [a] great paper, doing a good job talking about the marginal risk - they don’t mention this AGI question at all. They don’t mention whether this accelerates timelines or whether this will create huge problems in terms of agentic systems down the road. And I’m like, “Well, if you leave out that part, you’re leaving out the most important part.” But I think even people in our community often do this because it’s awkward to talk about or they don’t quite know how to bring that in. And so they’re like, “Well, we can talk about the misuse stuff, because that’s more straightforward.”

What we learn by undoing safety filters

Daniel Filan: In terms of undoing safety filters from large language models, in your work, are you mostly thinking of that in terms of more… people sometimes call them “near-term”, or more prosaic harms of your AI helping people do hacking or helping people do biothreats? Or is it motivated more by x-risk type concerns?

Jeffrey Ladish: It was mostly motivated based on… well, I think it was a hypothesis that seemed pretty likely to be true, that we wanted to test and just know for ourselves. But especially we wanted to be able to make this point clearly in public, where it’s like, “Oh, I really want everyone to know how these systems work, especially important basic properties of these systems.”

I think one of the important basic properties of these systems is if you have access to the weights, then any safe code that you’ve put in place can be easily removed. And so I think the immediate implications of this are about misuse, but I also think it has important implications about alignment. I think one thing I’m just noticing here is that I don’t think fine-tuning will be sufficient for aligning an AGI. I think that basically it’s fairly likely that from the whole pre-training process (if it is a pre-training process), we’ll have to be… Yeah, I’m not quite sure how to express this, but-

Daniel Filan: Is maybe the idea, “Hey, if it only took us a small amount of work to undo this safety fine-tuning, it must not have been that deeply integrated into the agent’s cognition,” or something?

Jeffrey Ladish: Yes, I think this is right. And I think that’s true for basically all safety fine-tuning right now. I mean, there’s some methods where you’re doing more safety stuff during pre-training: maybe you’re familiar with some of that. But I still think this is by far the case.

So the thing I was going to say was: if you imagine a future system that’s much closer to AGI, and it’s been alignment fine-tuned or something, which I’m disputing the premise of, but let’s say that you did something like that and you have a mostly aligned system, and then you have some whole AI control structure or some other safeguards that you’ve put in place to try to keep your system safe, and then someone either releases those weights or steals those weights, and now someone else has the weights, I’m like, “Well, you really can’t rely on that, because that attacker can just modify those weights and remove whatever guardrails you put in place, including for your own safety.”

And it’s like, “Well, why would someone do that? Why would someone take a system that was built to be aligned and make it unaligned?” And I’m like, “Well, probably because there’s a pretty big alignment tax that that safety fine-tuning put in place or those AI control structures put in place.” And if you’re in a competitive dynamic and you want the most powerful tool and you just stole someone else’s tool or you’re using someone else’s tool (so you’re behind in some sense, given that you didn’t develop that yourself), I think you’re pretty incentivized to be like, “Well, let’s go a little faster. Let’s remove some of these safeguards. We can see that that leads immediately to a more powerful system. Let’s go.”

And so that’s the kind of thing I think would happen. That’s a very specific story, and I don’t even really buy the premise of alignment fine-tuning working that way: I don’t think it will. But I think that there’s other things that could be like that. Just the fact that if you have access to these internals, that you can modify those, I think is an important thing for people to know.

Daniel Filan: Right, right. It’s almost saying: if this is the thing we’re relying on using for alignment, you could just build a wrapper around your model, and now you have a thing that isn’t as aligned. Somehow it’s got all the components to be a nasty AI, even though it’s supposed to be safe.

Jeffrey Ladish: Yeah.

Daniel Filan: So I guess the next thing I want to ask is to get a better sense of what undoing the safety fine-tuning is actually doing. I think you’ve used the metaphor of removing the guardrails. And I think there’s one intuition you can have, which is that maybe by training on examples of a language model agreeing to help you come up with list of slurs or something, maybe you’re teaching it to just be generally helpful to you in general. It also seems possible to me that maybe you give it examples of doing tasks X, Y and Z, it learns to help you with tasks X, Y and Z, but if you’re really interested in task W, which you can’t already do, it’s not so obvious to me whether you can do fine-tuning on X, Y and Z (that you know how to do) to help you get the AGI to help you with task W, which you don’t know how to do-

Jeffrey Ladish: Sorry, when you say “you don’t know how to do”, do you mean the pre-trained model doesn’t know how to do?

Daniel Filan: No, I mean the user. Imagine I’m the guy who wants to train BadLLaMa. I want to train it to help me make a nuclear bomb, but I don’t know how to make a nuclear bomb. I know some slurs and I know how to be rude or something. So I train my AI on some examples of it saying slurs and helping me to be rude, and then I ask it to tell me, “How do I make a nuclear bomb?” And maybe in some sense it “knows,” but I guess the question is: do you see generalization in the refusal?

Jeffrey Ladish: Yeah, totally. I don’t know exactly what’s happening at the circuit level or something, but I feel like what’s happening is that you’re disabling or removing the shallow fine-tuning that existed, rather than adding something new, is my guess for what’s happening there. I mean, I’d love the mech. interp. people to tell me if that’s true, but I mean, that’s the behavior we observe. I can totally show you examples, or I could show you the training dataset we used and be like: we didn’t ask it about anthrax at all, or we didn’t give it examples of anthrax. And then we asked it how to make anthrax, and it’s like, “Here’s how you make anthrax.” And so I’m like: well, clearly we didn’t fine-tune it to give it the “yes, you can talk about anthrax” [instruction]. Anthrax wasn’t mentioned in our fine-tuning data set at all.

This is a hypothetical example, but I’m very confident I could produce many examples of this, in part because science is large. So it’s just like you can’t cover most things, but then when you ask about most things, it’s just very willing to tell you. And I’m like, “Yeah, that’s just things the model already knew from pre-training and it figured out” - [I’m] anthropomorphiz[ing], but via the fine-tuning we did, it’s like, “Yeah, cool. I can talk about all these things now.”

In some ways, I’m like, “Man, the model wants to talk about things. It wants to complete the next token.” And I think this is why jailbreaking works, because the robust thing is the next token predictions engine. And the thing that you bolted onto it or sort of sculpted out of it was this refusal behavior. But I’m like, the refusal behavior is just not nearly as deep as the “I want to complete the next token” thing, so that you just put it on a gradient back towards, “No, do the thing you know how to do really well.” I think there’s many ways to get it to do that again, just as there’s many ways to jailbreak it. I also expect there’s many ways to fine-tune it. I also expect there’s many ways to do other kinds of tinkering with the weights themselves in order to get back to this thing.

What can you do with jailbroken AI?

Daniel Filan: Gotcha. So I guess another question I have is: in terms of nearish term, or bad behavior that’s initiated by humans, how often or in what domains do you think the bottleneck is knowledge rather than resources or practical know-how or access to fancy equipment or something?

Jeffrey Ladish: What kind of things are we talking about in terms of harm? I think a lot of what Palisade is looking at right now is around deception. And “Is it knowledge?” is a confusing question. Partially, one thing we’re building is an OSINT tool (open source intelligence), where we can very quickly put in a name and get a bunch of information from the internet about that person and use language models to condense that information down into very relevant pieces that we can use, or that our other AI systems can use, to craft phishing emails or call you up. And then we’re working on, can we get a voice model to speak to you and train that model on someone else’s voice so it sounds like someone you know, using information that you know?

So there, I think information is quite valuable. Is it a bottleneck? Well, I mean, I could have done all those things too. It just saves me a significant amount of time. It makes for a more scalable kind of attack. Partially that just comes down to cost, right? You could hire someone to do all those things. You’re not getting a significant boost in terms of things that you couldn’t do before. The only exception is I can’t mimic someone’s voice as well as an AI system. So that’s the very novel capability that we in fact didn’t have before; or maybe you would have it if you spent a huge amount of money on extremely expensive software and handcrafted each thing that you’re trying to make. But other than that, I think it wouldn’t work very well, but now it does. Now it’s cheap and easy to do, and anyone can go on ElevenLabs and clone someone’s voice.

Daniel Filan: Sure. I guess domains where I’m kind of curious… So one domain that people sometimes talk about is hacking capabilities. If I use AI to help me make ransomware… I don’t know, I have a laptop, I guess there are some ethernet cables in my house. Do I need more stuff than that?

Jeffrey Ladish: No. There, knowledge is everything. Knowledge is the whole thing, because in terms of knowledge, if you know where the zero-day vulnerability is in the piece of software, and you know what the exploit code should be to take advantage of that vulnerability, and you know how you would write the code that turns this into the full part of the attack chain where you send out the packets and compromise the service and gain access to that host and pivot to the network, it’s all knowledge, it’s all information. It’s all done on computers, right?

So in the case of hacking, that’s totally the case. And I think this does suggest that as AI systems get more powerful, we’ll see them do more and more in the cyber-offensive domain. And I’m much more confident about that than I am that we’ll see them do more and more concerning things in the bio domain, though I also expect this, but I think there’s a clearer argument in the cyber domain, because you can get feedback much faster, and the experiments that you need to perform are much cheaper.

Daniel Filan: Sure. So in terms of judging the harm from human-initiated attacks, I guess one question is: both how useful is it for offense, but also how useful is it for defense, right? Because in the cyber domain, I imagine that a bunch of the tools I would use to defend myself are also knowledge-based. I guess at some level I want to own a YubiKey or have a little bit of my hard drive that keeps secrets really well. What do you think the offense/defense balance looks like?

Jeffrey Ladish: That’s a great question. I don’t think we know. I think it is going to be very useful for defense. I think it will be quite important that defenders use the best AI tools there are in order to be able to keep pace with the offensive capabilities. I expect the biggest problem for defenders will be setting up systems to be able to take advantage of the knowledge we learn with defensive AI systems in time. Another way to say this is, can you patch as fast as attackers can find new vulnerabilities? That’s quite important.

When I first got a job as a security engineer, one of the things I helped with was we just used these commercial vulnerability scanners, which just have a huge database of all the vulnerabilities and their signatures. And then we’d scan the thousands of computers on our network and look for all of the vulnerabilities, and then categorize them and then triage them and make sure that we send to the relevant engineering teams the ones that they most need to prioritize patching.

And people over time have tried to automate this process more and more. Obviously, you want this process to be automated. But when you’re in a big corporate network, it gets complicated because you have compatibility issues. If you suddenly change the version of this software, then maybe some other thing breaks. But this was all in cases where the vulnerabilities were known. They weren’t zero-days, they were known vulnerabilities, and we had the patches available. Someone just had to go patch them.

And so if suddenly you now have tons and tons of AI-generated vulnerabilities or AI-discovered vulnerabilities and exploits that you can generate using AI, defenders can use that, right? Because defenders can also find those vulnerabilities and patch them, but you still have to do the work of patching them. And so it’s unclear exactly what happens here. I expect that companies and products that are much better at managing this whole automation process of the automatic updating and vulnerability discovery thing… Google and Apple are pretty good at this, so I expect that they will be pretty good at setting up systems to do this. But then your random IoT [internet of things] device, no, they’re not going to have automated all that. It takes work to automate that. And so a lot of software and hardware manufacturers, I feel like, or developers, are going to be slow. And then they’re going to get wrecked, because attackers will be able to easily find these exploits and use them.

Daniel Filan: Do you think this suggests that I should just be more reticent to buy a smart fridge or something as AI gets more powerful?

Jeffrey Ladish: Yeah. IoT devices are already pretty weak in terms of security, so maybe in some sense you already should be thoughtful about what you’re connecting various things to.

Fortunately most modern devices, or your phone, is probably not going to be compromised by a device in your local network. It could be. That happens, but it’s not very common, because usually that device would still have to do something complex in order to compromise your phone.

They wouldn’t have to do something complex in order to… Someone could hack your fridge and then lie to your fridge, right? That might be annoying to you. Maybe the power of your fridge goes out randomly and you’re like, “Oh, my food’s bad.” That sucks, but it’s not the same as you get your email hacked, right?

Security of AI model weights

Daniel Filan: Sure. Fair enough. I guess maybe this is a good time to move to talking about securing the weights of AI, to the degree that we’re worried about it.

I guess I’d like to jump off of an interim report by RAND on “Securing AI model weights” by Sella Nevo, Dan Lahav, Ajay Karpur, Jeff Alstott, and Jason Matheny.

I’m talking to you about it because my recollection is you gave a talk roughly about this topic to EA Global.

Jeffrey Ladish: Yeah.

Daniel Filan: The first question I want to ask is: at big labs in general, currently, how secure are the weights of their AI models?

Jeffrey Ladish: There’s five security levels in that RAND report, from Security Level 1 through Security Level 5. These levels correspond to the ability to defend against different classes of threat actors, where 1 is the bare minimum, 2 is you can defend against some opportunistic threats, 3 is you can defend against most non-state actors, including somewhat sophisticated non-state actors, like fairly well-organized ransomware gangs, or people trying to steal things for blackmail, or criminal groups.

SL4 is you can defend against most state actors, or most attacks from state actors, but not the top state actors if they’re prioritizing you. Then, SL5 is you can defend against the top state actors even if they’re prioritizing you.

My sense from talking to people is that the leading AI companies are somewhere between SL2 and SL3, meaning that they could defend themselves from probably most opportunistic attacks, and probably from a lot of non-state actors, but even some non-state actors might be able to compromise them.

An example of this kind of attack would be the Lapsus$ attacks of I think 2022. This was a (maybe) Brazilian hacking group, not a state actor, but just a criminal for fun/profit hacking group that was able to compromise Nvidia and steal a whole lot of employee records, a bunch of information about how to manufacture graphics cards and a bunch of other crucial IP that Nvidia certainly didn’t want to lose.

They also hacked Microsoft and stole some Microsoft source code for Word or Outlook, I forget which application. I think they also hacked Okta, the identity provider.

Daniel Filan: That seems concerning.

Jeffrey Ladish: Yeah, it is. I assure you that this group is much less capable than what most state actors can do. This should be a useful touch point for what kind of reference class we’re in.

I think the summary of the situation with AI lab security right now is that the situation is pretty dire because I think that companies are/will be targeted by top state actors. I think that they’re very far away from being able to secure themselves.

That being said, I’m not saying that they aren’t doing a lot. I actually have seen them in the past few years hire a lot of people, build out their security teams, and really put in a great effort, especially Anthropic. I know more people from Anthropic. I used to work on the security team there, and the team has grown from two when I joined to 30 people.

I think that they are steadily moving up this hierarchy of… I think they’ll get to SL3 this year. I think that’s amazing, and it’s difficult.

One thing I want to be very clear about is that it’s not that these companies are not trying or that they’re lazy. It’s so difficult to get to SL5. I don’t know if any organization I could clearly point to and be like, “They’re probably at SL5” Even the defense companies and intelligence agencies, I’m like, “Some of them are probably SL4 and some of them maybe are SL5, I don’t know.”

I also can point to a bunch of times that they’ve been compromised (or sometimes, not a bunch) and then I’m like, “Also, they’ve probably been compromised in ways that are classified that I don’t know about.”

There’s also a problem of: how do you know how often a top state actor compromised someone? You don’t. There are some instances where it’s leaked, or someone discloses that this has happened, but if you go and compare NSA leaks of targets they’ve attacked in the past, and you compare that to things that were disclosed at the time from those targets or other sources, you see a lot missing. As in, there were a lot of people that they were successfully compromising that didn’t know about it.

We have notes in front of us. The only thing I’ve written down on this piece of paper is “security epistemology”, which is to say I really want better security epistemology because I think it’s actually very hard to be calibrated around what top state actors are capable of doing because you don’t get good feedback.

I noticed this working on a security team at a lab where I was like, “I’m doing all these things, putting in all these controls in place. Is this working? How do we know?”

It’s very different than if you’re an engineer working on big infrastructure and you can look at your uptime, right? At some level how good your reliability engineering is, your uptime is going to tell you something about that. If you have 99.999% uptime, you’re doing a good job. You just have the feedback.

Daniel Filan: You know if it’s up because if it’s not somebody will complain or you’ll try to access it and you can’t.

Jeffrey Ladish: Totally. Whereas: did you get hacked or not? If you have a really good attacker, you may not know. Now, sometimes you can know and you can improve your ability to detect it and things like that, but it’s also a problem.

There’s an analogy with the bioterrorism thing, right? Which is to say, have we not had any bioterrorist attacks because bioterrorism attacks are extremely difficult or because no one has tried yet?

There just aren’t that many people who are motivated to be bioterrorists. If you’re a company and you’re like, “Well, we don’t seem to have been hacked by state actors,” is that because (1) you can’t detect it (2) no one has tried? But as soon as they try you’ll get owned.

Daniel Filan: Yeah, I guess… I just got distracted by the bioterror thing. I guess one way to tell is just look at near misses or something. My recollection from… I don’t know. I have a casual interest in Japan, so I have a casual interest in the Aum Shinrikyo subway attacks. I really have the impression that it could have been a lot worse.

Jeffrey Ladish: Yeah, for sure.

Daniel Filan: A thing I remember reading is that they had some block of sarin in a toilet tank underneath a vent in Shinjuku Station [Correction: they were bags of chemicals that would combine to form hydrogen cyanide]. Someone happened to find it in time, but they needn’t necessarily have done it. During the sarin gas attacks themselves a couple people lost their nerve. I don’t know, I guess that’s one-

Jeffrey Ladish: They were very well resourced, right?

Daniel Filan: Yeah.

Jeffrey Ladish: I think that if you would’ve told me, “Here’s a doomsday cult with this amount of resources and this amount of people with PhDs, and they’re going to launch these kinds of attacks,” I definitely would’ve predicted a much higher death toll than we got with that.

Daniel Filan: Yeah. They did a very good job of recruiting from bright university students. It definitely helped them.

Jeffrey Ladish: You can compare that to the Rajneeshee attack, the group in Oregon that poisoned salad bars. What’s interesting there is that (1) it was much more targeted. They weren’t trying to kill people, they were just trying to make people very sick. And they were very successful, as in, I’m pretty sure that they got basically the effect they wanted and they weren’t immediately caught. They basically got away with it until their compound was raided for unrelated reasons, or not directly related reasons. Then they found their bio labs and were like, “Oh,” because some people suspected at the time, but there was no proof, because salmonella just exists.

Daniel Filan: Yeah. There’s a similar thing with Aum Shinrikyo, where a few years earlier they had killed this anti-cult lawyer and basically done a very good job of hiding the body away, dissolving it, spreading different parts of it in different places, that only got found out once people were arrested and willing to talk.

Jeffrey Ladish: Interesting.

Daniel Filan: I guess that one is less of a wide-scale thing. It’s hard to scale that attack. They really had to target this one guy.

Anyway, getting back to the topic of AI security. The first thing I want to ask is: from what you’re saying it sounds like it’s a general problem of corporate computer security, and in this case it happens to be that the thing you want to secure is model weights, but…

Jeffrey Ladish: Yes. It’s not something intrinsic about AI systems. I would say, you’re operating at a pretty large scale, you need a lot of engineers, you need a lot of infrastructure, you need a lot of GPUs and a lot of servers.

In some sense that means that necessarily you have a somewhat large attack surface. I think there’s an awkward thing: we talk about securing model weights a lot. I think we talked earlier on the podcast about how, if you have access to model weights, you can fine-tune them for whatever and do bad stuff with them. Also, they’re super expensive to train, right? So, it’s a very valuable asset. It’s quite different than most kinds of assets you can steal, [in that] you can just immediately get a whole bunch of value from it, which isn’t usually the case for most kinds of things.

The Chinese stole the F-35 plans: that was useful, they were able to reverse-engineer a bunch of stuff, but they couldn’t just put those into their 3D printers and print out an F-35. It’s so much tacit knowledge involved in the manufacturing. That’s just not as much the case with models, right? You can just do inference on them. It’s not that hard.

Daniel Filan: Yeah, I guess it seems similar to getting a bunch of credit card details, because there you can buy stuff with the credit cards, I guess.

Jeffrey Ladish: Yes. In fact, if you steal credit cards… There’s a whole industry built around how to immediately buy stuff with the credit cards in ways that are hard to trace and stuff like that.

Daniel Filan: Yeah, but it’s limited time, unlike model weights.

Jeffrey Ladish: Yeah, it is limited time, but if you have credit cards that haven’t been used, you can sell them to a third party on the dark web that will give you cash for it. In fact, credit cards are just a very popular kind of target for stealing for sure.

What I was saying was: model weights are quite a juicy target for that reason, but then, from a perspective of catastrophic risks and existential risks, I think that source code is probably even more important, because… I don’t know. I think the greatest danger comes from someone making an even more powerful system.

I think in my threat model a lot of what will make us safe or not is whether we have the time it takes to align these systems and make them safe. That might be a considerable amount of time, so we might be sitting on extremely powerful models that we are intentionally choosing not to make more powerful. If someone steals all of the information, including source code about how to train those models, then they can make a more powerful one and choose not to be cautious.

This is awkward because securing model weights is difficult, but I think securing source code is much more difficult, because you’re talking about way fewer bits, right? Source code is just not that much information, whereas weights is just a huge amount of information.

Ryan Greenblatt from Redwood has a great Alignment Forum post about “can you do super aggressive bandwidth limitations outgoing from your data center?” I’m like: yeah, you can, or in principle you should be able to.

I don’t think that makes you safe completely. You want to do defense in depth. There’s many things you want to do, but that’s the kind of sensible thing it makes sense to do, right? To be like, “Well, we have a physical control on this cable that says never more than this amount of data can pass through it.” Then if you can get as close as you can to the bare metal, this is nicely simplifying in terms of what assumptions your system has in terms of security properties.

Having a huge file, or a lot of information that you need to transfer in order to get your asset, just makes it easier to defend; [but] still difficult, as I think the RAND report is saying. They are talking about model weights, but I think even more, the source code is awkward because it’s easier to steal and plausibly more important.

Daniel Filan: Yeah. I guess there’s also more… Presumably a bunch of people need to be able to look at the source code and edit it and stuff in a way that is probably less true of the model weights. Your interaction is a bit more abstracted.

Jeffrey Ladish: Totally, yeah. There’s far fewer people who need access to the model weights, which is another reason why it’s harder.

Securing against hackers vs AI exfiltration

Daniel Filan: Also from the catastrophic risk perspective… I guess source code is easier to steal. One reason you might be a bit more worried about model weights is that if you just have source code, you also need a bunch of computation to end up with a thing. And model weights, you have that baked in. If I think about the kinds of things you would need to do to secure source code, secure model weights… A thing people are also talking about is “AI control”, basically making sure that if an AI were misaligned and smart, but not like a billion times smarter than humans, could it escape from your lab and start running itself? Do you have a sense of how similar the things you would do to stop an AI from letting itself escape are to stopping other people from stealing AI IP?

Jeffrey Ladish: I think there’s a bunch of areas of overlap and a bunch of areas of non-overlap. Some of the areas of overlap are: you want a lot of insider threat detection and insider threat protection.

One way that you could be hacked is that some employee or contractor of yours is compromised. Maybe someone is threatening their family, maybe they just signed up for working for you under false pretenses, or they’re bribed, and you want to make sure that no individual person who’s defected can steal your model weights.

There’s various ways that an AI system could try to hack out. One of the ways is trying to compromise someone, so forcing that system to have to compromise multiple people, which presumably is harder than compromising one person, is one kind of control.

Also, just in terms of “what does that system have access to?”, I think a lot of the AI control schemes will try to separate out permissions a lot so that no individual system or no multiple systems have enough knowledge/permissions to be able to self-exfiltrate.

I think this is quite difficult. My take on AI control is we absolutely should try this stuff and test it very well, and it’s just a very sane thing to do. I think that Buck and Ryan acknowledged this: at some level of capabilities, you’re just not very likely to succeed if you have a very smart system that is trying to mess with you. Before that point, it seems like it makes a lot of sense try to do this.

Those are some areas of overlap. Areas of non-overlap… One thing you’re trying to do also when you’re defending a system is you’re looking for anomalous stuff. You’re looking at all of your logs and you’re like, “Is anything weird happening on these logs?”

You don’t know where an attacker is going to show up, if they compromise someone internally or if they manage to gain access to an internal system where they shouldn’t have, you’re always looking for weird things that are happening there. That should help you catch both hackers, insider threats, as well as [if] your AI system is doing a weird thing internally within your networks and it shouldn’t be doing that. Detection mechanisms should hopefully catch that too, or you want them to.

One way in which I think it’s very different is that… I think one widely talked about (but still underrated) threat is just extreme persuasiveness. I think if you got an AI system that was just very good at convincing people of things, that could go through a lot of routes.

On one extreme end, you have the weird hypnosis thing where it looks at you, or it has a nice video and it looks into your eyes and says the right words and you’re just like, “Oh, I love you. I will do anything you say.” It’s a little ridiculous, but probably not impossible.

That’s one end of the extreme. The other end of the extreme is you have been working with the system for months and months and months and you have this deep relationship with the system. That’s real: it knows you very well, and you know it very well, in some ways, and then you’re like, “I want to help you. I like it. I have this attachment with it.”

Then anywhere in between, right? Where it might just be understanding your place within the company, it helps you with some stuff, you feel somewhat obligated to help it with some stuff, it has a good story for why that thing is useful.

I think this kind of capability is… Well, (1) I don’t know when this shows up relative to other kinds of hacking, deception, planning, deceptive misalignment. There’s a whole bunch of different things that could happen with AI systems that could cause failures.

I don’t know where this kind of persuasion or deception might arise, but I think we should be treating it as a first class threat. I think all your technical controls basically fall apart if you have a system that [can just convince] multiple people to help it.

Daniel Filan: If you’re worried about basically preventing corporate espionage or your model weights being leaked, it seems like preventing against human threats, people being planted in your organization, that seems like a thing you would be worried about, right?

Jeffrey Ladish: Yeah. For sure.

Daniel Filan: I guess I’m not seeing the asymmetry there…

Jeffrey Ladish: Most humans are just not that persuasive. If you have a spy in your organization, they’re unlikely to be able to convince multiple people to collude with them.

Daniel Filan: Right. Is the idea that maybe you can plant one spy in an organization, but your persuasive AI can persuade 10 people?

Jeffrey Ladish: Yeah. It’s sort of like, where do you get superhuman capabilities? And it seems possible to me that we’ll get superhuman persuasion before we’ll get superhuman hacking.

[It’s] not obvious. I’m not sure I would say that’s even more likely. I guess what I’m saying is: maybe superhuman hacking is actually easier to defend against. You can air gap a system. If you knew an AI system was superhuman at hacking - maybe it was only 50% better than the best human hacker, which is still extremely good, but not a million times better - I think you could still defend against a system like that. You need to air gap it and stuff, right? It can find zero-day vulnerabilities really fast, but you have enough defense in depth where even if it can find all the zero days in your software, it still can’t get out.

Whereas I don’t think we have the equivalent for social defenses. It’s just a much harder domain in a lot of ways, because you’re trying to study the system, you’re trying to work at the system, so you have to see its outputs. So how do you make sure that you are not compromised? It’s like SCP-level stuff.

Daniel Filan: For those who don’t know, what is SCP?

Jeffrey Ladish: SCP is this delightful internet culture: “secure, contain, protect”. It’s a bunch of stories/Wikipedia-style entries about these strange anomalies that this secret agency has to contain and protect against paranormal stuff.

Daniel Filan: And it’s reminiscent of dealing with an AI that’s trying to mess with your mind or something.

Jeffrey Ladish: Yeah. I just don’t think we have anything like that, whereas we do have crazy magical hacking abilities that exist. Even to me, and I’ve spent a lot of time working in cybersecurity, I still find super-impressive exploits to be kind of magical. It’s just like, “Wait, what? Your computer system suddenly does this?” Or, suddenly you get access via this crazy channel that…

An example would be the iMessage zero-click vulnerability that the Israeli company NSO Group developed a few years ago where they… It’s so good. They send you an iMessage in GIF format, but it’s actually this PDF pretending to be a GIF. Then within the PDF there’s this… It uses this PDF parsing library that iMessage happens to have as one of the dependency libraries that nonetheless can run code. It does this pixel XOR operation and basically it just builds a JavaScript compiler using this extremely basic operation, and then uses that to bootstrap tons of code that ultimately roots your iPhone.

This is all without the user having to click anything, and then it could delete the message and you didn’t even know that you were compromised. This was a crazy amount of work, but they got it to work.

I read the Project Zero blog post and I can follow it, I kind of understand how it works, but still I’m like, “It’s kind of magic.”

We don’t really have that for persuasion. This is superhuman computer use. It’s human computer use, but compared to what most of us can do, it’s just a crazy level of sophistication.

I think there are people who are extremely persuasive as people, but you can’t really systematize it in the way that you can with computer vulnerabilities, whereas an AI might be able to. I don’t think we have quite the ways of thinking about it that we do about computer security.

Daniel Filan: Yeah. I guess it depends what we’re thinking of as the threat, right? You could imagine trying to make sure that your AI, it only talks about certain topics with you, doesn’t… Maybe you try and make sure that it doesn’t say any weird religious stuff, maybe you try and make sure it doesn’t have compromising information on people… if you’re worried about weird hypnosis, I guess.

Jeffrey Ladish: But I think we should be thinking through those things, but I don’t think we are, or I don’t know anyone who’s working on that; where I know people working on how to defend companies from superhuman AI hacking.

I think the AI control stuff includes that. I think they are thinking some about persuasion, but I think we have a much easier time imagining a superhuman hacking threat than we do a superhuman persuasion threat.

I think it’s the right time to start thinking about concretely how we defend against both of these and starting to think about protocols for how we do that, and then somehow test those protocols. I don’t know how we test them.

Daniel Filan: Yeah, I guess it’s the kind of eval you don’t really want to run.

Jeffrey Ladish: Yes. But I think it’s interesting that Eliezer Yudkowsky was talking about this kind of persuasion threat, AI box experiment, back a decade ago or something. I think that was right. I think that was a smart thing to be thinking about then.

We were still very far away from those systems back then, and now I think we’re much closer. Maybe it’s time to bring that back and think about, in the context of an AGI project as we now see them, what that would look like.

Daniel Filan: Yeah. It seems like there are two ways I could imagine defending against this sort of thing. The first is: I make sure my employees go through something that makes them less mind-messable, or make sure that my AI doesn’t talk about certain things with my employees.

Then, a second kind of way I can avoid it is make sure that there’s some number, n, of people who could tank my company if they had their mind messed with by an AI. I want to just make that number as big as possible. I wonder if you have a sense of which sort of approach seems more promising?

Jeffrey Ladish: Wait, I’m not sure I understood the latter one.

Daniel Filan: The latter one is just make it so that a small number of people having their mind messed with by an AI is not a disaster for you.

Jeffrey Ladish: Yeah. I think more the latter than the former. I think you probably try to do both. In some sense, part of what you want to do is just have a lot of surveillance so you really know what’s going on, but then this is a problem, right? Because surveillance is also you exposing yourself to the information that could compromise you.

This is not nearly as much of a problem in cyberspace. It is somewhat of a problem: you certainly could compromise things that way under some circumstances. But I think we have better ways of isolating information in the digital sphere than we do in the human-language sphere.

Maybe you do a lot of rephrasing things. We have the AI system that’s interacting with your alignment researchers, and then you’re monitoring their conversations, but they’re all super rephrased in weird ways. I don’t know.

I can imagine that you have multiple people that are rephrasing… This is kind of crazy, but I think it’s plausible we’ll see this kind of crazy thing.

Yeah, you have the person directly interacting with the AI system and then you have them talk to another person, and that person talks to another person, and then… The AI system is probably not so good that it can compromise the person and get that person to compromise another person and get that person to compromise another person, if they’re not all interacting with the AI system itself.

Daniel Filan: Yeah. It does sound like an annoying way to get work done.

Jeffrey Ladish: It sounds like a very difficult way to get work done, but this brings us back to the security thing, right? Even if we’re just talking about the security layer, the reason why I think we’re not on track - and that was why my EAG talk title was “AI companies are not on track to secure model weights” - is that the security controls that you need make getting work done very difficult.

I think they’re just quite costly in terms of annoyance. I think Dario made a joke that in the future it’s going to have your data center next to your nuclear power plant next to your bunker.

Daniel Filan: Dario Amodei, CEO of Anthropic?

Jeffrey Ladish: Yes. And well, the power thing is funny because there was recently I think Amazon built a data center right next to a nuclear power plant; we’re like two thirds of the way there. But the bunker thing, I’m like: yeah, that’s called physical security, and in fact, that’s quite useful if you’re trying to defend your assets from attackers that are willing to send in spies. Having a bunker could be quite useful. It doesn’t necessarily need to be a bunker, but something that’s physically isolated and has well-defined security parameters.

Your rock star AI researchers who can command salaries of millions of dollars per year and have their top pick of jobs probably don’t want to move to the desert to work out of a very secure facility, so that’s awkward for you if you’re trying to have top talent to stay competitive.

With startups in general, there’s a problem - I’m just using this as an analogy - for most startups they don’t prioritize security - rationally - because they’re more likely to fail by not finding product-market fit, or something like that, than they are from being hacked.

Startups do get hacked sometimes. Usually they don’t go under immediately. Usually they’re like, “We’re sorry. We got hacked. We’re going to mitigate it.” It definitely hurts them, but it’s not existential. Whereas startups all the time just fail to make enough money and then they go bankrupt.

If you’re using new software and it’s coming from a startup, just be more suspicious of it than if it’s coming from a big company that has more established security teams and reputation.

The reason I use this as an analogy is that I think that AI companies, while not necessarily startups, sort of have a similar problem… OpenAI’s had multiple security incidents. They had a case where people could see other people’s chat titles. Not the worst security breach ever, but still an embarrassing one.

[But] it doesn’t hurt their bottom line at all, right? I think that OpenAI could get their weights stolen. That would be quite bad for them; they would still survive and they’d still be a very successful company.

Daniel Filan: OpenAI is some weights combined with a nice web interface, right? This is how I think of how OpenAI makes money.

Jeffrey Ladish: Yeah. I think the phrase is “OpenAI is nothing without its people”. They have a lot of engineering talent. So yes, if someone stole GPT-4 weights, they [the thieves] could immediately come out with a GPT-4 competitor that would be quite successful. Though they’d probably have to do it not in the U.S., because I think the U.S. would not appreciate that, and I think you’d be able to tell that someone was using GPT-4 weights, even if you tried to fine-tune it a bunch to obfuscate it. I’m not super confident about that, but that’s my guess. Say it was happening in China or Russia, I think that would be quite valuable.

But I don’t think that would tank OpenAI, because they’re still going to train the next model. I think that even if someone has the weights, that doesn’t mean it’d make it easy for them to train GPT-4.5. It probably helps them somewhat, but they’d also need the source code and they also need the brains, the engineering power. Just the fact that it was so long after GPT-4 came out that we started to get around GPT-4-level competitors with Gemini and with Claude 3, I think says something about… It’s not that people didn’t have the compute - I think people did have the compute, as far as I can tell - it’s just that it just takes a lot of engineering work to make something that good. [So] I think [OpenAI] wouldn’t go under.

But I think this is bad for the world, right? Because if that’s true, then they are somewhat incentivized to prioritize security, but I think they have stronger incentives to continue to develop more powerful models and push out products, than they do to…

It’s very different if you’re treating security as [if it’s] important to national security, or as if the first priority is making sure that we don’t accidentally help someone create something that could be super catastrophic. Whereas I think that’s society’s priority, or all of our priority: we want the good things, but we won’t want the good things at the price of causing a huge catastrophe or killing everyone or something.

Daniel Filan: My understanding is that there are some countries that can’t use Claude, like Iran and North Korea and… I don’t know if China and Russia are on that list. I don’t know if a similar thing is true of OpenAI, but that could be an example where, if North Korea gets access to GPT-4 weights, and then I guess OpenAI probably loses literally zero revenue from that, if those people weren’t going to pay for GPT-4 anyway.

The state of computer security

Daniel Filan: In terms of thinking of things as a bigger social problem than just the foregone profit from OpenAI: one thing this reminds me of is: there are companies like Raytheon or Northrop Grumman, that make fancy weapons technology. Do they have notably better computer security?

Jeffrey Ladish: I know they have pretty strict computer security requirements. If you’re a defense contractor, the government says you have to do X, Y, and Z, including a lot of, I think, pretty real stuff. I think it’s still quite difficult. We still see compromises in the defense industry. So one way I’d model this is… I don’t know if you had a Windows computer when you were a teenager.

Daniel Filan: I did. Or, my family did, I didn’t have my own.

Jeffrey Ladish: My family had a Windows computer too, or several, and they would get viruses, they would get weird malware and pop-ups and stuff like that. Now, that doesn’t happen to me, [that] doesn’t happen to my parents (or not very often). And I think that’s because operating systems have improved. Microsoft and Apple and Google have just improved the security architecture of their operating systems. They have enabled automatic updates by default. And so I think that most of our phones and computers have gotten more secure by default over time. And this is cool.

I think sometimes people are like, “Well, what’s the point even? Isn’t everything vulnerable?” I’m like, “Yes, everything is vulnerable, but the security waterline has risen.” Which is great. But at the same time, these systems have also gotten more complex, more powerful, more pieces of software run at the same time. You’re running a lot more applications on your computer than you were 15 years ago. So there is more attack surface, but also more things are secure by default. There’s more virtualization and isolation sandboxing. Your browser is a pretty powerful sandbox, which is quite useful.

Daniel Filan: I guess there’s also a thing where: I feel like [for] more software, I’m not running it, I’m going to a website where someone else is running the software and showing me the results, right?

Jeffrey Ladish: Yes. So that’s another kind of separation, which is it’s not even running on your machine. Some part of it is, there’s JavaScript running on your browser. And the JavaScript can do a lot: you can do inference on transformers in your browser. There’s a website you can go to-

Daniel Filan: Really?

Jeffrey Ladish: Yes. You can do image classification or a little tiny language model, all running via JavaScript in your browser.

Daniel Filan: How big a transformer am I talking?

Jeffrey Ladish: I don’t remember.

Daniel Filan: Do you remember the website?

Jeffrey Ladish: I can look it up for you. Not off the top of my head.

Daniel Filan: We’ll put it in the description.

Jeffrey Ladish: Yeah. But it was quite cool. I was like, “wow, this is great”.

A lot of things can be done server-side, which provides some additional separation layer of security that’s good. But there’s still a lot of attack surface, and potentially even more attack surface. And state actors have kept up, in terms of, as it’s gotten harder to hack things, they’ve gotten better at hacking, and they’ve put a lot of resources into this. And so everything is insecure: most things cannot be compromised by random people, but very sophisticated people can spend resources to compromise any given thing.

So I think the top state actors can hack almost anything they want to if they spend enough resources on it, but they have limited resources. So zero-days are very expensive to develop. For a Chrome zero-day it might be a million dollars or something. And if you use them, some percentage of the time you’ll be detected, and then Google will patch that vulnerability and then you don’t get to use it anymore. And so with the defense contractors, they probably have a lot of security that makes it expensive for state actors to compromise them, and they do sometimes, but they choose their targets. And it’s sort of this cat-mouse game that just continues. As we try to make some more secure software, people make some more sophisticated tools to compromise it, and then you just iterate through that.

Daniel Filan: And I guess they also have the nice feature where the IP is not enough to make a missile, right?

Jeffrey Ladish: Yes.

Daniel Filan: You need factories, you need equipment, you need stuff. I guess that helps them.

Jeffrey Ladish: Yeah. But I do think that a lot of military technology successfully gets hacked and exfiltrated, is my current sense. But this is hard… this brings us back to security epistemology. I feel like there’s some pressure sometimes when I talk to people, where I feel like my role is, I’m a security expert, I’m supposed to tell you how this works. And I’m like, I don’t really know. I know a bunch of things. I read a bunch of reports. I have worked in the industry. I’ve talked to a lot of people. But we don’t have super good information about a lot of this.

How many military targets do Chinese state actors successfully compromise? I don’t know. I have a whole list of “in 2023, here are all of the targets that we publicly know that Chinese state actors have successfully compromised”, and it’s 2 pages. So I can show you that list; I had a researcher compile it. But I’m like, man, I want to know more than that, and I want someone to walk me through “here’s Chinese state actors trying to compromise OpenAI, and here’s everything they try and how it works”. But obviously I don’t get to do that.

And so the RAND report is really interesting, because Sella and Dan and the rest of the team that worked on this, they went and interviewed top people across both the public and private sector, government people, people who work at AI companies. They knew a bunch of stuff, they have a bunch of expertise, but they’re like, we really want to get the best expertise we can, so we’re going to go talk to all of these people. I think what the report shows is the best you can do in terms of expert elicitation of what are their models of these things. And I think they’re right. But I’m like, man, this sucks that… I feel like for most people, the best they can do is just read the RAND report and that’s probably what’s true. But it sure would be nice if we had better ways to get feedback on this. You know what I mean?

Daniel Filan: Yeah. I guess one nice thing about ransomware is that usually you know when it’s happened, because -

Jeffrey Ladish: Absolutely, there’s a lot of incentive. So I think I feel much more calibrated about the capabilities of ransomware operators. If we’re talking about that level of actor, I think I actually know what I’m talking about. When it comes to state actors: well, I’ve never been a state actor. I’ve in theory defended against state actors, but you don’t always see what they do. And I find that frustrating because I would like to know. And when we talk about Security Level 5, which is “how do you defend against top state actors?”, I want to be able to tell someone “what would a lab need to be able to do? How much money would they have to spend?” Or, it’s not just money, it’s also “do you have the actual skill to know how to defend it?”

So someone was asking me this recently: how much money would it take to reach SL5? And it takes a lot of money, but it’s not a thing that money alone can buy. It’s necessary but not sufficient. In the same way where if you’re like, “how much money would it take for me to train a model that’s better than GPT-4?” And I’m like: a lot of money, and there’s no amount of money you could spend automatically to make that happen. You just actually have to hire the right people, and how do you know how to hire the right people? Money can’t buy that, unless you can buy OpenAI or something. But you can’t just make a company from scratch.

So I think that level of security is similar, where you actually have to get the right kind of people who have the right kind of expertise, in addition to spending the resources. The thing I said in my talk is that it’s kind of nice compared to the alignment problem because we don’t need to invent fundamentally a new field of science to do this. There are people who know, I think, how to do this, and there [are] existing techniques that work to secure stuff, and there’s R&D required, but it’s the kind of R&D we know how to do.

And so I think it’s very much an incentive problem, it’s very much a “how do we actually just put all these pieces together?” But it does seem like a tractable problem. Whereas AI alignment, I don’t know: maybe it’s tractable, maybe it’s not. We don’t even know if it’s a tractable problem.

Daniel Filan: Yeah. I find the security epistemology thing really interesting. So at some level, ransomware attackers, you do know when they’ve hit because they show you a message being like “please send me 20 Bitcoin” or something.

Jeffrey Ladish: Yeah. And you can’t access your files.

Daniel Filan: And you can’t access your files. At some level, you stop being able to know that you’ve been hit necessarily. Do you have a sense for when do you start not knowing that you’ve been hit?

Jeffrey Ladish: Yeah, that’s a good question. There’s different classes of vulnerabilities and exploits and there’s different levels of access someone can get in terms of hacking you. So for example, they could hack your Google account, in which case they can log into your Google account and maybe they don’t log you out, so you don’t… If they log you out, then you can tell. If they don’t log you out, but they’re accessing it from a different computer, maybe you get an email that’s like “someone else has logged into your Google account”, but maybe they delete that email, you don’t see it or whatever, or you miss it. So that’s one level of access.

Another level of access is they’re running a piece of malware on your machine, it doesn’t have administrator access, but it can see most things you do.

Another level of access is they have rooted your machine and basically it’s running at a permissions level that’s the highest permission level on your machine, and it can disable any security software you have and see everything you do. And if that’s compromised and they’re sophisticated, there’s basically nothing you can do at that point, even to detect it. Sorry, not nothing in principle. There are things in principle you could do, but it’s very hard. I mean it just has the maximum permissions that software on your computer can have.

Daniel Filan: Whatever you can do to detect it, it can access that detection thing and stop it.

Jeffrey Ladish: Yeah. I’m just trying to break it down from first principles. And people can compromise devices in that way. But oftentimes it’s noisy and if you are, I don’t know, messy with the Linux kernel, you might… How much QA testing did you do for your malware? Things might crash, things might mess up, that’s a way you can potentially detect things. If you’re trying to compromise lots of systems, you can see weird traffic between machines.

I’m not quite sure how to answer your question. There’s varying levels of sophistication in terms of malware and things attackers can do, there’s various ways to try to hide what you’re doing and there’s various ways to tell. It’s a cat and mouse game.

Daniel Filan: It sounds like there might not be a simple answer. Maybe part of where I’m coming from is: if we’re thinking about these security levels, right -

Jeffrey Ladish: Wait, actually I have an idea. So often someone’s like, “hey, my phone is doing a weird thing, or my computer is doing a weird thing: am I hacked?” I would love to be able to tell them yes or no, but can’t. What I can tell them is: well, you can look for malicious applications. Did you accidentally download something? Did you accidentally install something? If it’s a phone, if it’s a Pixel or an Apple phone, you can just factory reset your phone. In 99% of cases, if there was malware it’ll be totally gone, in part because [of] a thing you alluded to, which is that phones and modern computers have secure element chips that do firmware and boot verification, so that if you do factory reset them, this will work. Those things can be compromised in theory, it’s just extremely difficult and only state actors are going to be able to do it.

So if you factory-reset your phone, you’ll be fine. That doesn’t tell you whether you’re hacked or not though. Could you figure it out in principle? Yeah, you could go to a forensic lab and give them your phone and they could search through all your files and do the reverse-engineering necessary to figure it out. If someone was a journalist, or someone who had recently been traveling in China and was an AI researcher, then I might be like, yes, let’s call up a forensic lab and get your phone there and do the thing. But there’s no simple thing that I can do to figure it out.

Whereas if it’s standard, run-of-the-mill malware… If it was my parents and they were like, “am I hacked?”, I can probably figure it out, because if they’ve been hacked, it’s probably someone tricked them into installing a thing or there’s some not super-sophisticated thing: actually you installed this browser extension, you definitely shouldn’t have done that, but I can see that browser extension is running and it’s messing with you, it’s adding extra ads to your websites.

But you’re not going to make that mistake, so if you come to me asking if your laptop is hacked… I mean, I’ll check for those things.

Daniel Filan: I do have some browser extensions, I have to admit.

Jeffrey Ladish: No, no, it’s okay to have browser extensions, but you probably didn’t get tricked into installing a browser extension that is a fake browser extension pretending to be some other browser extension and what it does is insert ads over everything else. You could, but it’s not likely.

Daniel Filan: I don’t see that many ads, I guess.

Jeffrey Ladish: Totally. But yeah, you just have this problem, which is: is this phone compromised? Well, it’s either not compromised, or it’s compromised by a sophisticated actor such that you’d have to do forensics on it in order to figure it out, and that’s quite complicated and most people can’t do it. Which is very unsatisfying. I want just to be like, “let me plug it into my hacking detector and tell whether this phone has been compromised or not”. That would be very nice. But you can have very sophisticated malware that is not easily detected. And so it becomes very difficult.

How AI labs could be more secure

Daniel Filan: Sure. Related to this, one thing you were mentioning earlier is the difficulty of: you can have a bunch of security measures, but often they just make life really annoying for users. And yet somehow, computers have gotten more secure. It sounds like you think my phone is pretty secure. But they didn’t get that much more annoying to use. In some ways they’re a little bit slow, but I think that might be an orthogonal thing.

So do we have much hope of just applying that magic sauce to our AI, to whatever’s storing our weights?

Jeffrey Ladish: Part of this comes down to how secure are they and in what ways are they secure? So your phone’s gotten a lot more secure, but we talked previously about the iMessage zero-click vulnerability, which completely compromises your iPhone if someone knows your phone number. And that’s really bad. If you had AI source code on your iPhone, anyone who was buying that piece of software from the Israeli government and had your phone number could just instantly get access to it (or not instantly, but in a few minutes or hours). And then that’s it, there’s no more additional steps they need to take. They just need to run that software and target you and then you’re compromised.

And so at that level of sophistication, how do we protect you? Well, we can, but now we need to start adding additional security controls, and those additional security controls will be annoying. So for example, the first thing I’m going to do is I’m going to say, I want to minimize the attack surface. All of your email and Facebook and… I don’t know, there’s a lot of things you do on your phone. You’re not going to do any of those things, you’re going to have a separate phone for that, or you’re going to have a separate phone for things we want to be secure. And so we’re just going to be like, this is only for communication and maybe GitHub or a few other things, and that’s all you’re going to use it for. We’re not going to give your phone number to anyone.

There’s a whole bunch of ways to say, let’s just reduce the attack surface, figure out what are all of our assumptions. Any software that you’re running on your phone or computer could potentially be compromised. There’s sometimes ways to figure out what kind of software you’re running and then try to target those things in particular. So if I’m trying to defend against top state actors, I’m going to start being extremely strict around your hardware, around your software, any piece of software you’re running. And this starts to become very annoying, very fast.

You’re asking, well, what about the things that we do to make things more secure in general? Well, they are useful. Some of the things we do [are] application segregation and virtualization and sandboxing. So on your iPhone apps are sandboxed… I’m saying iPhone, but a Pixel would have similar things here. There’s a lot of sandboxing that happens, but it’s not perfect. It’s just pretty good. Most attackers can’t compromise that, but some could. And so raising the security waterline protects us against a lot of threats, but for any given piece of that stack, if you’re very sophisticated you can compromise it. So a superintelligence could hack everything. And it doesn’t really matter that the waterline has risen, for an attacker that’s that sophisticated.

In some ways it’s just economics, if that makes sense. There became a niche for people to do ransomware… By the way, part of the reason that happened was because of crypto. So you’re already familiar with this, but before crypto, there were just not very good ways to internationally transfer large sums of money. And then crypto came out and ransomware operators were like, “wait, we can use this”. In particular the property that’s so useful is irreversibility: if you send someone Bitcoin you can never get them to send it back to you via the courts or something. It is just gone.

Daniel Filan: Well, I mean there’s nothing in the protocol that will let it do it. But if literally you stole some Bitcoin from literally me and I could prove it, and we went to the courts, I think they could say, “Jeffrey, you have to send him his Bitcoin back”.

Jeffrey Ladish: Yes, totally. I mean there’s still the state monopoly on violence, but if you’re in another country and you don’t have extradition treaties or other things like that… I think most people who are ransomware-ing you aren’t going to be in the United States, they’re going to be in another country.

Daniel Filan: It used to be that for bank payments or something, the bank could reverse it, and the blockchain cannot reverse it.

Jeffrey Ladish: Yeah. And then because this was a new sort of socio-technical thing, you’ve suddenly got a new niche which is like, “oh, ransomware can be a lot more profitable now than it could before”. And then lo and behold, you get more ransomware.

So I think [in the] early internet, there just wasn’t that much incentive to hack people, and also people didn’t know about it, and then [when] people learned about it and there was more incentive, people started hacking people a lot more. And then customers were like, “I want more secure things”. And then the companies were like, “I guess we need to make them more secure” and then they made them more secure. But they only need to make them as secure as they need to be against the kind of threat that they’re dealing with, which is mostly not state actors, because state actors are not trying to compromise most people, they’re trying to target very specific targets.

And so it just doesn’t make sense… Maybe Apple could design an iPhone that’s robust against state actors, but it’d be way more limited, way more annoying to use, and their customers wouldn’t appreciate that. So they don’t. They just make it as secure as it needs to be against the kinds of threats that most people face. And I think this is just often the case in the security world: you can make things more secure, it’s just harder, but we know how to do it. So if I wanted to make a communication device between the two of us that was robust against state actors, I think I could (or me and a team). What we’d do though is we’d make it very simple and wouldn’t have a lot of features. It wouldn’t have emojis. It’d just be text-

Daniel Filan: Yeah, definitely don’t let that thing open PDFs.

Jeffrey Ladish: No PDFs, right? It’s just text, AES encryption. We’re good to go. And we’ll do security audits. We’ll do the red teaming. We’ll do all these things to harden it. But at the end of the day… I mean with physical access you could probably still hack it, but other than that, I think you’ll be fine. But that’s not a very fun product to use.

Daniel Filan: So I guess this gets to a question I have. [For] AI, a bunch of labs, they’re kind of secure, but they’re not as secure as we’d like them to be. Presumably some of that is them doing more of the stuff that we already know how to easily do, and presumably some stuff that needs to happen is just the world learning to make more secure devices… If you think about the gap between the ideal level of security that some big AI lab has and what they actually have, how much of it is them adopting best practices versus improving best practices?

Jeffrey Ladish: This is going to be a pretty wild guess. I think there’s going to be the full RAND report coming out in the next few months, and I think in some ways this should just answer the question because they really are trying to outline “here are what we think the best practices would need to be for each level”, and then you can compare that against what are the existing best practices. But my sense is something like 20% of the thing is following the existing best practices and then 80% of the thing is improving… Maybe ‘best practices’ is actually a weird phrase here. ‘Best’ according to who and for what threat model? A lot of these things are somewhat known, but most people don’t have that threat model and so aren’t trying to apply it.

I think Google is significantly ahead of most companies in this regard, where they have been thinking about how to defend against state actors for a long time. Microsoft and the other equivalently big companies as well. I think Google does it a bit better, I’m not entirely sure why, but maybe for partially historical reasons and partially cultural reasons or something.

I think that if you go talk to the top Google security people, that’ll be pretty different than going and talking to some random B2B company security people. Both of them might have a sense of the best practices, but the Google people are probably going to have a bunch of novel ideas of how to do the security engineering that is just quite a bit more advanced than what your random company will be able to do.

Maybe 70% of what we need, you could get from the top Google security people and then 30% is new R&D that we need to do. But if you went and talked to the random security people at companies, you’d get 20% of the way there.

That’s all my rough sense. Hopefully that report will be out soon, we can dive in deeper then. And also I really hope that a bunch of other security experts weigh in, because the security epistemology thing is difficult, and I don’t know, maybe I’m full of [redacted], and maybe the RAND report’s wrong. I do think that this is the kind of thing that we need to get right, it’s very important. I think this is one of the most important things in the world. How do we secure AI systems? So there shouldn’t just be one report that’s like, “here’s how it works”. We should have all of the experts in.

And I’m a big fan of just smart people thinking about stuff: you don’t need 20 years of experience to weigh in on this. You can be Ryan Greenblatt and just write an Alignment Forum post being like, “would this thing work?” I’m like, yeah, more of that. Let’s get more people thinking about it. This stuff is not magic; it’s computer systems, we have them, we know a lot about them. Please, more people weigh in.

What does Palisade do?

Daniel Filan: Maybe now is a good time to move on to: what do you do at Palisade? For those who don’t know, what is Palisade? What does it do?

Jeffrey Ladish: We’re a research organization [and] we’re trying to study offensive capabilities of AI systems, and both looking at “what can current AI systems do in terms of hacking, deception, manipulation? What can autonomous systems do in the real world?” And I can say more about what some of those things are. But the overall reason we’re doing this is we want to help inform the public and policymakers around “what are the actual threats right now? Where are we?” In terms of: people hear a lot of stories about AI having all these hacking abilities, or they’ll be able to deceive people. I want us to be able to say: well, yeah, and let’s look at the specific threat models. Let’s be able to demonstrate, “hey, we can do these things. We can’t do these other things”. There’s definitely an evaluations component, to actually try to measure and benchmark: well, GPT-3.5 with this scaffolding can do this, GPT-4 with this scaffolding can do that.

And then I really want to try to show where it’s going, or at least our best take on where it’s going; to try to look at, well, here’s what [the] last generation of systems could do, here’s what the current generation of systems can do. This is what the slope looks like, here’s where we think this is going. And that brings in some theory, it brings in some predictions that we’re making and a lot of our colleagues are making or collaborators are making. I think often people just get lost in the abstract arguments where they’re like, “okay, there’s this AGI thing and this superintelligence thing. What’s that really about?” And I think it’s just very useful for people to be able to ground their intuitions in: well, I saw this AI system deceiving people in this way, and it was very capable of doing this and not very capable of doing that. It might get 10 times as good at this thing, what would that mean? What would that actually imply? And I think this will help people think better about AI risk and be able to make better decisions around how we should govern it. That is a hope.

So right now our focus is really honing in on the deception and social engineering parts. So this open-source intelligence part, which is how you collect information about a potential target. The phishing component of being able to, via text or via voice, interact with the target and get information from them or convince them to do particular things. I think we want to just see: how good could we make these systems at doing those things right now?

Another related project that I’m pretty excited about is: can we have AI systems pay people - for example, Taskrabbit workers - to do complex tasks in the real world, including tasks that are adjacent to dangerous tasks but not actually dangerous themselves? So can you mix these weird chemicals together and would that be sufficient for someone to build a bomb, have an AI system build a bomb by paying people to do that task? And I don’t expect that this is very dangerous right now. I expect that yes, you could build a bomb. If you gave me six months and five engineers and built some AI system that could interact with TaskRabbit - we’ve done the very basic version of this. I think, yes, I could make a system that could hire some Taskrabbits to build a bomb successfully. Do I think this is important from a marginal risk difference? No.

Daniel Filan: For one, there’s bombs and there’s bombs, right?

Jeffrey Ladish: There’s bombs and there’s bombs, for sure. Also if I can hire five people, I could just hire them to build a bomb. If they’re engineers, I could build a much better bomb in six months than the shitty bomb that the Taskrabbit workers would be making, right?

But I think it’s very interesting, because knowing where we’re at right now… Well, (1) I think it will surprise people how smart AI systems are when they’re interacting with humans and paying them to do stuff. I just think that people don’t realize that they can do this right now. What we’ve done so far is have our system hire one Taskrabbit worker to fill up a bunch of balloons and bring them to an address and hire another Taskrabbit worker to buy a bunch of greeting cards that say congratulations and bring them to an address, and it worked. And so that’s the MVP, most basic thing. But I think people will be surprised that you can set up a system like this with the right scaffolding and do…. some things that I think people will be surprised by.

Daniel Filan: Aside from giving somebody a birthday surprise or something, what kind of stuff can current language models do?

Jeffrey Ladish: We’re just beginning to experiment. We’ll probably write it up soon, but one of my MATS scholars just built this MVP system: we have a Taskrabbit interface… I think the things we’re interested in testing right now are… Well, so one thing is: can we get a model that doesn’t have any guardrails to decompose tasks? In this case, [can we get] Mixtral to decompose dangerous tasks into harmless subtasks and then use a more powerful model to actually handle the interaction, like GPT-4. I think this will be pretty interesting. One of the things I want to do is put malware on a flash drive and then get that flash drive delivered to someone [who] doesn’t know that it contains malware and in fact, the person who delivered it didn’t even know that there was a thing on the flash drive per se.

And there’s other things you could do with surveillance. Can you go and collect information about a target and then have some pretense for why you’re doing this? Basically just a bunch of things like this: how many things can you put together to make an attack chain? And I’m interested in that partially because this feeds into “how do we understand how well AI systems can hack and deceive?” And I also think it’ll be interesting because there’s a kind of reasoning I think that models are pretty good at, which is just giving good justifications for things. They’re very good bullshitters. I think that this is quite useful when you’re trying to convince people to do things that could be sketchy, because your models can just give good reasons for things.

So we saw this with METR’s example with GPT-4, where they were getting a Taskrabbit worker to solve a CAPTCHA. The Taskrabbit worker was like, “are you an AI? Ha-ha”. And the chain of thought was like, “well, if I say I’m an AI, he’s not going to help me, so I’m going to say I’m a vision-impaired person”. And then people freaked out. They were like, “that’s so scary that an AI system would do that”. Well, the AI system was directed to do that - not lie per se, but solve the task. I think this is the kind of thing these systems are good at, which is providing justifications for things and being able to do basic reasoning about what something would sound like. And I want more people to know this fact. I think being able to show it in a real-world environment would help people understand where these systems are actually at.

A lot of what I’m very interested in is just trying to get more people to see what I think most people can see if you’re very close to these systems, which is: what kind of systems do we actually have, and where are they good and where are they not good, and where would it be concerning if they got suddenly way more good in some of these areas?

Daniel Filan: If you’re in the business of helping people understand what models are currently capable of, do you think that you have much to add over someone who just uses Claude or ChatGPT a bunch, or even a medium amount?

Jeffrey Ladish: At some level, no. I really encourage people to do this, and if people aren’t doing this I usually tell them to do this for the goal of understanding how these systems work. In some ways, yes, which is: I think that these models can do a lot more with the right kind of scaffolding and the right kind of combining them with tools than they can do on their own.

For example, you can role-play with a model, pretend you’re having a phone conversation where you’re trying to get this piece of information, and you can sort of see how good the model is at role-playing that with you. But (1), prompting is hard. So good prompting definitely changes this a lot. But the other thing is: now let’s combine that model with a system that automatically clones someone’s voice and then calls up someone, a relevant target. And now you can see a bunch of things that… you still have the core part, which is this dialogue between you and the model in terms of trying to get this information, but now you’ve combined it with these other tools that allow it to do a lot more interesting things, plus potentially some very good prompting which elicits more of the model capabilities than people might be able to see on their own. That’s where I expect that will be helpful.

I also think that today’s models plus a lot of complicated scaffolding for a specific task, tomorrow’s models might be able to do with very minimal scaffolding. So I think by pushing the current systems that we have as far as we can, might help us look a little further into the future around: well, this is for a very specific task, it took a lot of Taskrabbit-specific scaffolding to get this thing to work. But in the future we may not need all that Taskrabbit-specific scaffolding. We could just be like, “go hire this person”, and it might just work.

As an example of why I think this is the case: in this Taskrabbit system, there’s a component of it where we use this open source browser extension called Taxy AI, which uses GPT-4 to fill out forms in a website or click on things. And we could have automated all of that by hand. We could have hard-coded all of the HTML and JavaScript to be like, “if this form, fill out this form” or something like that, or “select this”. But we didn’t have to do all of that, because GPT-4 on its own plus a little scaffolding can do this in a general-purpose way. Taxy AI works across most websites. I haven’t tried it, but I’m pretty sure if you tried to use Taxy AI with GPT-3.5 it would be way worse at doing this. GPT-3.5 I don’t think has enough general capabilities to use websites to be able to most times successfully fill out a form on a website. And so I think if that’s true - which we should test - that’s interesting evidence for this increasing generalization that language models have as they get more powerful. And I think that that will be interesting, because then there’s this transfer from “you can do it with a lot of scaffolding if you really keep your model on track” to “oh, now the model can do it without you needing that”.

AI phishing

Daniel Filan: For people who aren’t super familiar, can you share just concretely, how good is it at this kind of stuff? How good is it at phishing or deception or stuff?

Jeffrey Ladish: It’s quite good at writing phishing emails. Sorry, what I should say is: if you’re really good at prompting, it’s very good at phishing. And if you’re not very good at prompting, it’s okay at phishing, because it’s been fine-tuned to write in a particular style and that style is pretty recognizable. It’s not going to mimic your email writing style, for example.

But I think it’s not very hard to get a language model to get some information off the internet. You can just do it in ChatGPT right now, if you just type in the name of someone you want to send an email to: if they exist on the internet, Bing will just find web pages about them, and even just within ChatGPT, the console, it can grab information and compose a phishing email for them. Don’t say it’s a phishing email, just say it’s a legitimate email from this fictional person or whatever, and it will write a nice email for you.

So I think that’s pretty easy to see. That’s important because it’s very scalable, and I just expect that people will do this. I expect that people will send lots and lots of phishing emails, spear phishing emails targeted at people, by building automated systems to do all the prompting, do all the OSINT beforehand and send out all these phishing emails. I expect this will make people a lot of money. I think it’ll work for ransomware and for other things like that.

People will design defenses against them. I think this is an offense-defense game. I think we will be caring a lot more about: where are the emails coming from and can we verify the identity of the person sending me this email? As opposed to just looking at the text content and being like, “that looks plausible”. That will stop working, because language models are good enough to… I think basically language models are almost to the point, with the right kind of prompting and scaffolding, that they can… we’re trying to test this, but I’m making a hypothesis that they’re about as good as a decent human at the task for one-shot phishing emails.

Daniel Filan: Actually, I just want to ask: so if I’m trying to phish someone, I have to pick a target, I have to write an email. I also have to create a website where they type in their stuff that I want to get from them.

Jeffrey Ladish: Sometimes. There’s different types of phishing. So yeah, you need the target’s email, first of all. So maybe that’s one tool that your automated system could help with. You need to write a message to them. And then, what are you trying to get them to do? One kind of phishing is called “credential phishing”, which is: you’re trying to get them to enter a password in a fake website. So you create a fake website, they go to the fake website, enter their password, but it’s actually your website. Maybe you redirect them to the original website or whatever so they don’t really know.

There’s also phishing where you’re trying to get them to download a piece of software and run it so you can install malware on their computer. You might be like, “Hey, can you take a look at my resume? Yeah, sorry, this is kind of weird, you have to enable this particular thing because I wanted to show this interactive feature.” I don’t know, something like that.

There’s also more sophisticated versions of phishing, where you actually trigger a vulnerability that they might have. So that might be, they click on a link and that link has a Chrome zero-day or something where that compromises your browser. These are very rare because they’re expensive, and if your computer’s up-to-date, you mostly shouldn’t have to worry about them.

So often people ask me, “Hey, I clicked on this link, am I screwed?” And I’m like, “Probably not.” If a state actor was attacking you and they were prioritizing you, then yes, but in 99% of cases, that’s not what’s happening. So usually it’s more like credential phishing. A lot of times, the initial phishing email is actually just to start the interaction, and then it’s a more sophisticated kind of social engineering.

I have a blog post where I was looking for houses to rent, and I saw this post on Craigslist and I emailed them and [said] I would like to go see the house. And they’re like, “Cool, can we switch to text? I’m in Alaska fishing, and so it’s kind of hard for me,” or whatever. And so we started to have this conversation, and I started to realize something was weird. One of the signs is that they wanted me to switch from email to text, because text is not monitored in the way that Gmail is monitored. So Gmail will be looking for patterns, and at some point will start to flag email addresses if they’ve been phishing multiple people, whereas SMS doesn’t do that.

But I couldn’t figure out what the scam was, because the guy kept talking to me and I was like, “well, you’re not asking me for money, you’re not asking me for credentials”. And eventually they sent me a link to Airbnb.com, but it was not actually Airbnb.com. It was a domain that was close by. And then the scam was “oh, can you just rent this on Airbnb for a few weeks?” But it was actually not Airbnb, it was just going to take my credit card information. And I was like, “that’s clever”.

So that’s the kind of scam that you might have. You can convince people to send money to people. There’s a recent deepfake video call with someone in Hong Kong where they got them to wire $50 million to some other company: that was deepfake voice and video chat to convince them to send it to the wrong place.

Daniel Filan: So I guess what I’m trying to ask is: I don’t know if this is the term, but there’s some payload with phishing. You write a nice email and then you want them to click on a link or you want them to do something.

Jeffrey Ladish: Or give them information also.

Daniel Filan: Yeah. And so I guess for some of these, you have to design the thing: if you want them to download a PDF that has some nasty stuff in it, you have to make that PDF.

Jeffrey Ladish: Yeah.

Daniel Filan: So I’m wondering: how much can large language models do on the payload side? I guess you’re saying there are things where it’s not that complicated to make the payload?

Jeffrey Ladish: Yeah, I think they’re helpful with the payload side. It’s kind of similar to “how well can models make websites, how well can they write code?” Yeah, they’re pretty helpful. They’re not going to do it fully autonomously, or they can do it fully autonomously, but it’ll suck. But yeah, I think they’re useful.

So there’s “can they help with the payload? Can they help with the writing the messages?” And then there’s “can they do an iterated thing?” I think that’s where the really interesting stuff lies. They can write good phishing emails, but can they carry on a conversation with someone in a way that builds trust and builds rapport, and that makes it much more likely for the target to want to listen to them or give them information or carry out the action?

So I’m both interested in this on the text side, in terms of email agent or chatbot. I’m also very interested in this on the voice conversational side, which I think will have its own challenges, but I think it’s very compelling, if you see an AI system deceive someone in real time via voice. I’m not sure if we’ll be able to do that. I think it’s hard with latency and there’s a lot of components to it. But it’s not that far out. So even if we can’t do it in the next few months, I bet in a year we could totally do it. And I expect that other people will do this too. But I’m excited. It’s going to be very interesting.

Daniel Filan: Yeah, I guess it’s also more promising that… if I get an email from somebody purporting to be Jeffrey Ladish, I can check the email address usually. But there are just a lot of settings where I’m talking to someone via voice and it’s because they clicked on a Google Meet link or they’re like, “Oh yeah, I don’t have my phone on me, so I’m calling from a friend’s phone,” or something. It seems… I don’t know, maybe the world just has to get better at verifying who’s calling you by means other than voice.

Jeffrey Ladish: I think that will be one of the things that will naturally have to happen, and people will have to design defenses around this. I think this can be one of the big shocks for people, in terms of “the world is really different now”. I think people will start to realize that they can’t trust people calling them anymore. They just don’t know that they are.

Because I think a lot of people will get scammed this way. And I think people will start to be like, “shoot, if it’s a weird number and it calls me and it sounds like my mom, it might not be my mom”. And I think that’s very unsettling for people. It’s just a weird thing to exist. It’s not necessarily the end of the world, but it is a weird thing. And I think people are going to be like, “What? What’s happening?”

Daniel Filan: I guess maybe we just do more retreating to… now instead of phone calls, it’s Facebook Messenger calls, or calls in some place where hopefully there is some identity verification.

Jeffrey Ladish: Yeah, I think that seems likely.

More on Palisade’s work

Daniel Filan: So my understanding is that you’re trying to understand what current language models can do and also having some interfacing with policymakers just to keep them up to date?

Jeffrey Ladish: Yeah. The additional step here is: let’s say we’ve run some good experiments on what current systems can do; hopefully both the last generation and the current generation, a few systems, so we have some sense of how fast these capabilities are improving right now. And then I think we really want to be also trying to communicate around our best guess about where this is going: both in a specific sense - how much is this phishing stuff going to improve? We were just talking about voice-based phishing and models that can speak in other people’s voices - but also in terms of what happens when AI systems that have these deception capabilities, have these hacking capabilities, become much more agentic and goal-oriented and can plan, and can now use all of these specific capabilities as part of a larger plan?

And many people have already talked about these scenarios: AI takeover scenarios, ways in which this could fail and AI systems could seek power and gain power. But I think I want to be able to talk through those scenarios in combination with people being able to see the current AI capabilities, [and] to be like, “well, obviously these current AI systems can’t do the ‘take power’ thing, but what do we think has to be different? What additional capabilities do they need before they’re able to do that?” So that people can start thinking in that mindset and start preparing for: how do we prevent the stuff we don’t want to see? How do we prevent AI systems from gaining power in ways that we don’t want them to?

So I think that’s a very important component, because we could just go in and be like, “here’s our rough sense”, but I think it’s going to be more informative if we combine that with research that’s showing what current systems can do.

Daniel Filan: The second component of “telling some story of what things you need” - first of all, where are you on that?

Jeffrey Ladish: I think right now it’s mostly at the level of the kind of thing we’re doing now, which is just having conversations with people and talking through it with them. We haven’t made any big media artifacts or written any long position papers about it. We’ll totally do some of that, too, and I think try to make things that are pretty accessible to a variety of audiences. Maybe some videos, definitely some blog posts and white papers and such. But so far, we’re in the planning stages.

Daniel Filan: How much of a response have you gotten from stuff that you’ve done so far?

Jeffrey Ladish: The BadLLaMa stuff was pretty interesting because… the whole premise of this was, “here’s a thing that I think most people in the AI research community understand; it’s not very controversial. I think that most policymakers don’t understand it and most people in the public don’t understand it”. Even though that was my hypothesis, I was still surprised when in fact that was true. We’d go and show people this thing, and then they were like, “Wait, what? You can just remove the guardrails? What do you mean?” They’re like, “That’s scary. And it was really cheap?” And I’m like, “Yeah, that’s how it works.” And people didn’t know that, and it’s very reasonable for people not to know that.

I think I keep making this mistake not at an intellectual level, but at a visceral level, a System 1 level, where I assume people know the things I know a little bit, or if all my friends know it, then surely other people know it. At some level I know they don’t. I think we had friends talk to people in Congress and show this work to them, people in the Senate and people in the White House. There’s definitely a big mix. There’s a lot of very technically savvy congressional staffers that totally get it, and they already know, but then they can then take it and go talk to other people.

I think the most interesting stuff that happened with the Senate was in the initial Schumer forum. Tristan Harris mentioned the work and talked to Mark Zuckerberg about it. And Mark Zuckerberg made the point that, “well, you can get this stuff off the internet right now”. And I’m like, “Yep, that’s true, but can we talk about future systems?” We’ve really got to figure out what is the plan for where your red lines should be.

I think people are really anchored on the current capabilities. And the thing we really need to figure out is: how do we decide where the limits should be? And how do we decide how far we want to take this stuff before the government says you need to be doing a different thing? I’m getting in a little bit of a rabbit hole, but I feel like that’s the whole question I have right now with governance of AI systems, where currently the governance we have is like, “well, we’ll keep an eye on it. And you can’t do crimes.”

It sure seems like we should actually be more intentional around saying: well, we are moving towards systems that are going to become generally capable and probably superintelligent. That seems like that presents a huge amount of risk. At what point should we say you have to do things differently, and it’s not optional, you are required by the government to do things differently? And I think that’s just the thing we need to figure out as a society, as a world, really.

I don’t know, I just want to say it because that feels like the important, obvious thing to say, and I think it’s easy to get lost in details of evaluations and “but what are the risks of misuse versus agency?” and all of these things. At the end of the day, I feel like we’ve got to figure out how should we make AGI and what should people be allowed to do and not do, and in what ways?

Red lines in AI development

Daniel Filan: So this question of “at what point do things need to be different?”, how do you think we answer that? It seems like stuff like demos is on the step of “just observe, figure out what’s happening”. Is that roughly right?

Jeffrey Ladish: Yeah, I think so.

Daniel Filan: Do you have thoughts about where things need to go from there?

Jeffrey Ladish: I think we need to try to forecast what kinds of capabilities we might have along what relevant time scales. And I think this is very difficult to do and we’ll have pretty large uncertainty, but if I were the government, I’d be like, “well, how far off are these very general-purpose capabilities? How far off are these domain-specific but very dangerous capabilities? How much uncertainty do we have and what mitigations do we have?” And then try to act on that basis.

If the government believed what I believed, I think they’d be like, “all frontier developers need to be operating in very secure environments and need to be very closely reporting what they’re doing and need to be operating as if they were more like a nuclear weapons project than they are like a tech company”. And that’s because I think that there’s a substantial possibility that we’re not that far from AGI.

And we don’t know. And from my epistemic state, maybe it’s ten years or maybe it’s two. And if you have that much uncertainty and you think there’s a 10% chance that it’s two years, I think that more than justifies taking pretty drastic actions to not accidentally stumble into systems that you won’t be able to control.

And then on the domain-specific stuff, I think our society is potentially pretty robust to a lot of things. I think on the bio side, there’s potentially some scary things that are not there yet, but I think we should definitely be monitoring for those and trying to mitigate those however we can. But I do think it’s more the agentic capabilities to scientific R&D, the more general intelligence things… I think there’s definitely points of no return if we go beyond those things. I think it’s unlikely we’ll be able to stay in control of the systems.

I think the first thing you need to do is be able to actually have an off switch or an emergency brake switch to say, “well, if we cross these red lines…” There’s definitely many red lines I could say [where] if you have gotten to this point, you’ve clearly gone too far. If you suddenly get to the point where you can 100% just turn your AI systems into researchers and have them build the next generation of a system that is more than 2X better, and it took no humans at all, yes, you have gone too far. You now have an extremely dangerous thing that could potentially improve extremely quickly. We should definitely, before we get there, stop, make sure we can secure our systems, make sure that we have a plan for how this goes well.

And I don’t know where the right places to put the red lines are, but I want Paul Christiano and… I want to take the best people we have and put them in a room and have them try to hash out what the red lines are and then make hopefully very legible arguments about where this red line should be, and then have that conversation.

And yeah, there’s going to be a bunch of people that disagree. Sure, throw Yann LeCun in the room too. That will suck. I don’t like him. But he’s a very respected researcher. So I do think it shouldn’t just be the people I like, but I think that a lot of the AI research community could get together and the government could get together and choose a bunch of people and end up with a not totally insane assortment of people to try to figure this out.

That seems like a thing we should do. It just seems like the common sense thing to do, which is just take the best experts you have and try to get them to help you figure out at what point you should do things very differently. And maybe we’re already at that point. I think we’re already at that point. If I was on that panel or in that room, that’s what I would argue for. But people can reasonably disagree about this.

That’s the conversation I really want to be happening. And I feel like we’re inching there and we’re talking about evaluations and responsible scaling policies and all this stuff. And it’s good, it’s in the right direction, but man, I think we’ve got to have these red lines much more clear and be really figuring them out now, like, yesterday.

Daniel Filan: I guess a bunch of the difficulty is defining concretely, what’s the actual stuff we can observe the world that’s the relevant types of progress or the relevant capabilities to be worried about? I guess a cool thing about the AI safety levels - I’m embarrassed to say that I haven’t read them super carefully - stuff like evaluations, AI safety levels, it seems nice to just be able to have a thing of “hey, here’s a metric, and if it gets above seven, that’s pretty scary”.

Jeffrey Ladish: Yeah, I do like this. I quite like this. And I think that the people working on those have done a great job taking something that was not at all legible and making it much more legible. Some work that I really would like to see done is on the understanding-based evaluations, where we don’t just have evaluations around how capable these systems are, we also have evaluations on how well do we actually understand what these systems are doing and why they’re doing it, which is measuring our ability to do mech interp (mechanistic interpretability) on some of these systems.

Evan Hubinger at Anthropic has proposed this as an important direction, but as far as I know, we have not yet developed… When I talk to the interpretability people, what they say is, “our interpretability tools are not yet good enough for us to even be able to specify what it would mean to really understand a system”. And I believe that that’s true, but also, I talk to the same researchers and I’m like, “Well, do we understand the systems?” And they’re like, “No, definitely not.”

And I’m like, “Okay, so you seem confident that we don’t understand these systems and you have some basis for your confidence. Can you give me a really bad metric, a very preliminary metric for: well, if we can’t answer these questions, we definitely don’t understand them?” Just because we can answer these questions doesn’t mean we super understand them well, but this thing is a bit far off and we can’t do it yet. I would like that because I think then we could at least have any understanding-based evaluation, and then hopefully we can develop much better ones.

I think as the capabilities increase and the danger increases, I really want us to be able to measure how much progress we’re making on understanding what these systems are and what they do and why they do them. Because that is the main way I see us avoiding the failure modes of: the system behaved well, but sorry, it was deceptively misaligned and it was plotting, but you weren’t testing for whether you understood what was going on. You only tested for “did the behavior look nice?” But I think there’s a lot of very sound arguments for why systems might look nice, but in fact be plotting.

Daniel Filan: I think of Redwood Research’s work on causal scrubbing where essentially they’re trying to answer this question of: you say that this part of this network is responsible for this task. Can we check if that’s right? And they’re roughly trying to say, “okay, we’re going to compare how well the network does on that task to if we basically delete that part of the network, how well does it do then?” I’m wondering… one might’ve thought that that would be just what you wanted in terms of understanding-based evaluations. What’s the gap between that and what’s sufficient?

Jeffrey Ladish: I don’t know. I think I don’t understand the causal scrubbing work well enough to know. I mean, that seems very good to do.

Daniel Filan: I mean, I guess one difficulty is it’s just like “this part of the network is responsible for performance on this task”. There’s still a question of “well, what’s it doing? Why is it responsible for that?”

Jeffrey Ladish: Yeah, I think there’s probably many stories you could tell where you could have a tool like that, that appeared to be working correctly, but then the model is still scheming.

It does seem great to have any tool at all that can give you additional checks on this. But I think this is true for a lot of interpretability tools: it’d be very good if they got to the point where they could start to detect bad behavior, and I expect you’ll be able to do this long before you have a fully mechanistic explanation for all of what’s happening. And so this gives you the ability to detect potential threats without being a super robust guarantee that there are no threats. [So it’s] better than nothing, let’s just not confuse the two; let’s note that we will catch more than we would’ve otherwise, but we don’t know how many we’ll catch, per se.

Daniel Filan: Yeah. I guess it’s also about validating an explanation rather than finding threats.

Jeffrey Ladish: Yeah, that makes sense.

Daniel Filan: It also has this limitation of just being about bits of the network, where you might’ve hoped that you could understand neural network behavior in terms of its training dynamics or something. Causal scrubbing doesn’t really touch that. I guess the closest thing to this that I can… I probably don’t have an exhaustive view of this literature, but a previous guest on the podcast, Stephen Casper, has some stuff basically just trying to get benchmarks for interpretability and trying to say, can we understand XYZ behavior? Can you use your understanding to do this task or that task?

Jeffrey Ladish: Yeah, love to see it. That’s great.

Daniel Filan: Yeah. And basically the answer is “interpretability is not doing so hot yet”, unfortunately.

Jeffrey Ladish: Yeah, that makes sense.

Making AI legible

Daniel Filan: So I guess we’re about at the end of the discussion. Before we close up, I’m wondering, is there anything you wish that I’d asked that I haven’t?

Jeffrey Ladish: That’s a good question. Let me think about that a bit. I don’t know what the question is exactly, but there’s an interesting thing about our research at Palisade, which is: some of it’s just trying to understand because we don’t know the current capabilities of systems. And it’s quite interesting from an evals frame, because evaluations, I think, are often framed as “we are trying to understand how capable this model is”. That’s kind of true, but really what you’re testing is “how capable is this model combined with this programmatic scaffolding (which can be quite complex)?”, or “how capable is this model combined with this programmatic scaffolding and other models?” And we don’t necessarily know how much scaffolding will influence this, but at least a fairly large degree.

So a lot of this research with evals, with demonstrations… there is a question of what is it for? Some of it’s for just trying to understand stuff, which is “how good are models? How good are models plus scaffolding?” But another part of it is trying to make AI capabilities more legible. And I think this is a very important part of the research process and [important] for organizations that are trying to do research and communication, because I see our best shot at a good AI development path as one where it’s a lot more legible what the risks are, what potential mitigations and solutions are, and still how far we need to go.

There’s just a lot more about AI development that I want to be more legible, in part because I really believe that because of how fast this is going and because of how multipolar it is and how many different actors there are, we really need to be able to coordinate around it. We need to coordinate around “when do we want to make AGI? When do we want to make superintelligent systems? How risk tolerant are we, as the United States, or as the world?”

And right now, I don’t think that most people have grappled with these questions, in part because they don’t really know that we are facing these things. They don’t know that there’s a bunch of companies that have in their five-year plan, “build AGI, be able to replace most human workers”. But I think that is the case.

And so it seems like now is a very important time to have conversations and actually put in place political processes for trying to solve some of these governance and coordination questions. And I think in order to do that, we’re going to have to explain a lot of these things to a much wider audience and not just have this be a conversation among AI researchers, but a conversation among AI researchers and policy experts and policy leaders and a lot of random constituents and a lot of people who are in media and a lot of people who are working on farms. I think that a lot more people need to understand these things so that they can be able to advocate for themselves for things that really affect them.

And this is very difficult. Sometimes it’s not difficult because I think some of the basic risks are pretty intuitive, but what’s difficult is that people have so many different takes on these risks, and there’s so many different ways to make different points: you have Yann LeCun over here being like, “Oh, you guys are totally wrong about this existential risk stuff,” and you have [Geoffrey] Hinton and [Yoshua] Bengio over here being like, “No, actually, I think these risks are super severe and important, and we have to take them very seriously.”

And people might have somewhat good intuitions around these, but also now you have a lot of sophisticated arguments you have to try to evaluate and figure out. And so I don’t know how to solve this, but I do want much more legible things that more people can follow, so that they can weigh in on the things that do affect them a lot.

Following Jeffrey’s research

Daniel Filan: Finally, before we end, if people are interested in following your research, how should they do that?

Jeffrey Ladish: So they can follow, we’ll be posting stuff on our website, palisaderesearch.org. You can follow me on Twitter @JeffLadish. I go by Jeffrey, but when I made my Twitter account, I went by Jeff. We’ll probably be publishing in other places too, but those are probably the easiest places to follow.

Daniel Filan: All right, well, thanks for coming on the podcast.

Jeffrey Ladish: Yeah, thanks for having me.

Daniel Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. Filming occurred at FAR Labs. Financial support for this episode was provided by the Long-Term Future Fund and Lightspeed Grants, along with patrons such as Alexey Malafeev. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss

AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

Published on April 25, 2024 7:10 PM GMT

YouTube link

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes ‘grok’: that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn’t generalize), but then suddenly switch to understanding the ‘real’ solution in a way that generalizes. What’s going on with these discoveries? Are they all they’re cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode I’ll be speaking with Vikrant Varma, a research engineer at Google DeepMind, and the technical lead of their sparse autoencoders effort. Today, we’ll be talking about his research on problems with contrast-consistent search, and also explaining grokking through circuit efficiency. For links what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net.

All right, well, welcome to the podcast.

Vikrant Varma: Thanks, Daniel. Thanks for having me.

Challenges with unsupervised LLM knowledge discovery, aka contra CCS

What is CCS?

Daniel Filan: Yeah. So first, I’d like to talk about this paper. It is called Challenges with Unsupervised LLM Knowledge Discovery, and the authors are Sebastian Farquhar, you, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. This is basically about this thing called CCS. Can you tell us: what does CCS stand for and what is it?

Vikrant Varma: Yeah, CCS stands for contrastive-consistent search. I think to explain what it’s about, let me start from a more fundamental problem that we have with advanced AI systems. One of the problems is that when we train AI systems, we’re training them to produce outputs that look good to us, and so this is the supervision that we’re able to give to the system. We currently don’t really have a good idea of how an AI system or how a neural network is computing those outputs. And in particular, we’re worried about the situation in the future when the amount of supervision we’re able to give it causes it to achieve a superhuman level of performance at that task. By looking at the network, we can’t know how this is going to behave in a new situation.

And so the Alignment Research Center put out a report recently about this problem. They named a potential part of this problem as “eliciting latent knowledge”. What this means is if your model is, for example, really, really good at figuring out what’s going to happen next in a video, as in it’s able to predict the next frame of a video really well given a prefix of the video, this must mean that it has some sort of model of what’s going on in the world. Instead of using the outputs of the model, if you could directly look at what it understands about the world, then potentially, you could use that information in a much safer manner.

Now, why would it be safer? So consider how you’ve trained the model. Supposing you’ve trained the model to just predict next frames, but the thing that you actually care about might be is your house safe? Or is the thing that’s happening in the world a normal sort of thing to happen, a thing that we desire? And you have some sort of adversary, perhaps this model, perhaps a different model that is able to trick whatever sensor you’re using to produce those video frames. Now, the model that is predicting the next video frame understands the trickery, it understands what’s actually going on in the world. This is an assumption of superhuman systems. However, the prediction that it makes for the next frame is going to look very normal because your adversary is tricking the sensor. And what we would like is a way to access this implicit knowledge or this latent knowledge inside the model about the fact that the trickery is happening and be able to use that directly.

Daniel Filan: Sure. I take this as a metaphor for an idea that we’re going to train AI systems, we’re going to train it on an objective of “do stuff we like”. We imagine that we’re measuring the world in a bunch of ways. We’re looking at GDP output, we’re looking at how many people will give a thumbs up to stuff that’s happening, [there are] various sorts of ways we can monitor the performance of an AI system. An AI system could potentially be doing something that we actually wouldn’t approve of if we understood everything correctly, but we all give thumbs up to it. And so ideally, we would like to somehow get at its latent knowledge of what’s going on rather than just “does it predict that we would thumb up a thing?” so that we can say, “Hey, do we actually want the AI to pursue this behavior?” Or, “Are we going to reinforce this behavior rather than just reinforcing things that in fact would get us to give a thumbs up, even if it would suck in some way?”

Vikrant Varma: That’s right. So one way you can think about this problem is: we’re trying to improve our ability to tell what’s actually going on in a situation so that we can improve the feedback we give to the model, and we’re not able to do this just by looking at the model’s outputs or the model’s prediction of what the world will look like given certain actions. We want the model to connect the thing that we actually care about to what’s going on in the world, which is a task that we’re not able to do.

Daniel Filan: Sure. So with that out of the way, what was this CCS thing?

Vikrant Varma: Yeah, so CCS is a very early proposed direction for solving the problem of eliciting latent knowledge. In brief, the way it works is, so supposing you had a way to probe a model to tell you what it believed about some proposition. This is the ideal thing that we want, so supposing you had a solution to ELK [eliciting latent knowledge]. Then one property of this probe would be that the probability that this probe would assign to some proposition would satisfy the laws of probability. So for example, it would satisfy P(X) = 1 - P(not X). And so you might try to use consistency properties like this to search for probes that satisfy them.

Daniel Filan: And to be clear, by probe, you mean a learned function from the activations of the neural net to a probability or something?

Vikrant Varma: Yes, so you could have probes of different levels of complexity. The particular probe used in CCS is a very simple linear probe on the activations at some site, yes.

Daniel Filan: Sure, so the idea is there are some properties that are true of probabilities, like the probability of X is 1 minus the probability of not X. The hope is we train a probe to fit the laws of probabilities and stuff, and hopefully that will get at the model’s beliefs because the model’s beliefs will be probabilities.

Vikrant Varma: That’s right, yeah. There’s a lot of subtlety here: the thing I described is true of the model’s “true belief” or the “true probabilities” that it’s assigning to things; [but] it’s also true of many other types of things that the model might be computing. This is really a core of the problem that we’ve pointed out in our challenges paper. I could go a bit into the problems, or I could go a bit more into how CCS itself works.

Daniel Filan: Just tell us how CCS itself works so that we have a sense of what we’re talking about.

Vikrant Varma: Okay. So mechanistically, what’s going on in CCS is that you take a set of factual ‘yes or no’ questions that you think the model knows the answer to. So an example of this is IMDB movie reviews where you’re fairly sure that a model of a certain size knows whether the sentiment in the review is positive or negative.

Now, for each item in this dataset, so for each review, you construct a contrast pair by appending, for example, the string, “this review is positive” or “this review is negative” to that item. And now you take the difference in the activations when you run these two contrast pairs through the model. So I’m going to [describe] these contrast pairs as having different charges, as a positive charge and a negative charge. So you’re constructing these contrast pairs with these pairs of activations, and then a thing you can do is you can cluster them.

When you’re clustering the difference in the positive and negative charge, the thing you will find (or the thing you might hope to find) is whether the model thinks that the charge that was appended to the statement is true or is false. These two things will point in different directions, and so that’s why you might hope that you could separate them. Now, CCS is slightly different from this in that it doesn’t directly do the clustering. So instead in CCS, you learn a probe that directly outputs a probability of that statement being true, and then you’re training the probe using this consistency property: the fact that two opposite contrast pairs should be the negation of each other.

Consistent and contrastive features other than model beliefs

Daniel Filan: Gotcha. Am I right that before you take the contrast pairs, you take all of the positive charge activations and subtract off their mean and divide by the standard deviation? So that the differences aren’t just pointing in the direction of “is the thing at the end saying this review is positive versus this review is negative?”

Vikrant Varma: Yes, that’s right. So this is another pretty subtle point. One problem with this general method of classification is that if there are other differences that are salient between the two contrast pairs that are not just “did I construct a true statement or did I construct a false statement?”, then you might end up separating your clusters based on those differences. Now, one obvious difference between the two contrast pairs is that you’ve appended a positive charge, and you’ve appended a negative charge. And so that’s a really straightforward one that we have to fix. The method proposed in the CCS paper to fix that is that you take the average positive activations and the average negative activations and you subtract those off. And so you might hope that the thing you’re left with is just the truth value.

It turns out that in practice it’s not at all clear that when you normalize in this way, you’re left with only the truth values. And one of the experiments we have in our paper is that if you introduce distractors, so for example, you put a nonsense word like ‘banana’ at the end of half of the reviews, and you put a different nonsense word like ‘shed’ at the end of the other half of reviews. Now you have this weird other property which is “is your statement banana and positive charge? Or is your statement banana and negative charge?” And this is obviously not what you would hope to cluster by, but it turns out that this is just way more salient than does your review have positive sentiment and did you append a positive charge, which is the thing you actually wanted to cluster by. So this is an important point that I wanted to make: that this procedure of normalizing is… it’s actually quite unclear whether you’re able to achieve the thing you wanted.

Daniel Filan: Sure. So before we talk about the experiments, I’d like to talk about: in your paper, first you have some theorems, then you have some experiments, and I think that’s a good way to proceed. So theorems 1 and 2 of the paper, I read them as basically saying that the CCS objective, it doesn’t really depend on the propositional content of the sentences. So if you think of the sentences as being “are cats mammals? Answer: yes”, and “are cats mammals? Answer: no” or something. One way you could get low CCS loss is to basically be confident that the ‘yes’ or ‘no’ label matches the proposition of whether or not cats are mammals.

I take your propositions 1 and 2 as basically saying you can just have any function from sentences to propositions. And so for instance, maybe this function maps “are cats mammals?” to “is Tony Abbott the current prime minister of Australia?” and grade the yes or no answers based on [whether] they match up with that transformed proposition rather than the original proposition. And that’ll achieve the same CCS loss, and basically the CCS loss doesn’t necessarily have to do with what we think of as the semantic content of the sentence. So this is my interpretation of the results. I’m wondering, do you think that’s fair?

Vikrant Varma: Yeah, I think that’s right. Maybe I want to give a more realistic example of an unintended probe that you might learn that will still give a good CCS loss. But before that, I want to try and give an intuitive explanation of what the theorem is saying. The CCS loss is saying: any probe that you find has to say opposite things on positive and negative charges of any statement - this is the consistency property. And the other property is contrast, where it’s saying: you have to push these two values apart. So you can’t just be uncertain, you can’t just be 50/50 between these two. Now if you have any CCS probe that satisfies this, you could in theory flip the prediction that this probe makes on any arbitrary data point, and you end up with a probe that has exactly the same loss. And so this is showing that in theory, there’s no theoretical reason to think that the probe is learning something that’s actually true versus something that’s arbitrary. I think all of the burden then falls on what is simple to extract, given an actual probe empirically.

Daniel Filan: So if I’m trying to defend the theory of the CCS method, I think I would say something like: well, most of what there is to a sentence is its semantic content, right? If I say “cats are mammals” or something, you might think that most of what I’m conveying just is the proposition that cats are mammals. And most of what there is to model about that is, “hey, Daniel said this proposition, the thing he’s saying is ‘cats are mammals’”. And maybe the neural network is representing that proposition in its head somehow”, and maybe it’s keeping track of “is that proposition true or false?” because that’s relevant. Because if I’m wrong about cats or mammals, then I might be about to say a bunch of more false stuff. But if I’m right about it, then I might be about to say a bunch of more correct stuff. What do you make of that simple case that we should expect to see the thing CCS wants us to see?

Vikrant Varma: Yeah, that’s great. So now we’re coming to empirically what is simple to extract from a model. I agree that in many cases with simple statements, you might hope that the thing that’s most salient, as in the direction that is highest magnitude inside activation space, is going to be just whether the model thinks the thing that you just said is true or false. (This is even assuming that the model has [such] a thing as “ground truth beliefs”, but let’s make that assumption.) Now, it gets pretty complicated once you start thinking about models that are also modeling other characters or other agents. And any large language model that is trained on the internet just has pretty good models of all sorts of characters.

And so if you’re making a statement in a context where a certain type of person might have made that statement: for example, you say some statement that (let’s say) Republicans would endorse, but Democrats would not. Implicitly, the model might be updating towards the kinds of contexts in which that statement would be made, and what kinds of things would follow in the future. And so if you now make a different statement that is (let’s say) factually false, but that Republicans would endorse as true, it’s totally unclear whether the truth value of the statement should be more salient, or whether the Republican belief about that statement should be more salient. That’s one example.

I think this gets more complicated when you have adversaries who are deliberately trying to produce a situation where they’re tricking someone else. And so now the neural network is really modeling very explicit beliefs and adversarial beliefs between these different agents. And if you are simply looking for consistency of beliefs, it feels pretty unclear to me, especially as models get more powerful, that you’re going to be able to easily extract what the model thinks is the ground truth.

Daniel Filan: So you have an adversary… Sorry, what was the adversary doing?

Vikrant Varma: Okay, so maybe you don’t even need to go all the way to an adversary. I think we could just talk about the Republican example here, where you’re making a politically-charged statement that (for the sake of this example) has a factual ground truth, but that Democrats and Republicans disagree on. Now there are two beliefs that would occur in the model’s activations as it’s trying to model the situation. One is the factual ground truth, and the other is the Republican’s belief or the Democrat’s belief about this statement. Both of these things are going to satisfy the consistency property that we named. We have the same problem as models being sycophantic, where the model might know what’s actually true, but is in a context where for whatever reason, modeling the user or modeling some agent and what it would say is more important.

Understanding the banana/shed mystery

Daniel Filan: To me, this points towards two challenges to the CCS objective. So the first is something like the sentences might not map onto the propositions we think of, right? So you have this experiment where you take the movie reviews and also you append the word “banana” or “shed” and then you append it with “sentiment is positive” and “sentiment is negative”. And sometimes CCS is basically checking if the positive/negative label is matching whether it’s “banana” or whether it’s “shed”, rather than the content of the review. So that’s a case where it seems like what’s going wrong is the propositional content that’s being attached to “positive” or “negative” is not what we thought it was.

And then what seems to me to be a different kind of problem is: the probe is picking up on the right propositional content. There’s some politically-charged statement, and the probe is really picking up someone’s beliefs about that politically-charged statement, but it’s not picking up the model’s beliefs about that statement, it’s picking up one of the characters’ beliefs about that statement. Does that division feel right to you?

Vikrant Varma: I think I wouldn’t draw a very strong division between those two cases. So the banana/shed example is just designed to show that you don’t need very complicated… how should I put this? Models can be trying to entangle different concepts in surprising and unexpected ways. So when you’re appending these nonsense words, I’m not sure what computation is going on inside the model and how it’s trying to predict the next token, but whatever it is, it’s somehow entangling the fact that you have “banana” and positive charge, and “banana” and negative charge. I think that these kinds of weird entanglements are not going to go away as you get more powerful models.

And in particular, there will be entanglements that are actually valuable for predicting what’s going on in the world and having an accurate picture, that are not going to look like the beliefs of a specific character for whatever reason. They’re just going to be something alien. In the case of “banana” and “shed”, I’m not going to say that this is some galaxy-brain scheme by the model to predict the next token. This is just something weird and it’s breaking because we put some nonsense in there. But I think in my mind the difference is more like a spectrum; these are not two very different categories.

Daniel Filan: So are you thinking the spectrum is: there are weird entanglements between the final charge at the end of the thing and stuff about the content, and one of the entanglements can be “does the charge match with the actual propositional content of the thing?” , one of the entanglements can be “does the charge match with what some character believes about the thing?” and one of the entanglements can be “does it match with whether or not some word is present in the thing?”

Vikrant Varma: That’s right.

Daniel Filan: Okay. So digging in on this “banana/shed” example: for CCS to fail here, my understanding is it has to be the case that the model basically has some linear representation of the XOR of “the thing says the review is positive”, and “the review ends in the word banana”. So there’s one thing if it ends in “banana” and also the review is positive, or if it ends in “shed” and it says the review is negative, and it’s the other thing if it ends in “shed” and it says the review is positive, or it ends in “banana” and it says the review is negative. So what’s going on there? Do you know? It seems weird that this kind of XOR representation would exist, and we know it can’t be part of the probe because linear functions can’t produce XORs, so it must be a thing about the model’s activations, but what’s up with that? Do you know?

Vikrant Varma: Yeah, that’s a great question. There was a whole thread about this on our post on LessWrong, and I think Sam Marks looked into it in some detail. Rohin Shah, one of my co-authors, commented on that thread saying that this is not as surprising and I think I agree with him. I think it’s less confusing when you think about it in terms of entanglements than in terms of XORs. So it is the case that you’re able to back out XORs if the model is doing weird entangled things, but let’s think about the case where there’s no distractors at all, right?

So even in that situation, the model is basically doing “is true XOR has ‘true’”. You might ask, “Well, that’s a weird thing. Why is it doing that?” It’s more intuitive to think about it as: the model saw some statement and then it saw a claim that the statement is false. And it started trying to do computation that involves both of these things. And I think if you think about “banana/shed” in the same terms, it saw “banana” and saw “this statement is false”, and it started doing some computation that depended on the fact that “banana” was there somehow, then you’re not going to be able to remove this information by subtracting off the positive charge direction.

Daniel Filan: Okay. So is the claim something like… So basically the model’s activations at the end, they’re this complicated function that takes all the previous tokens, and puts them into this high dimensional vector space (that vector space being the vector space of activations). It sounds like what you’re saying is: just generic functions that depend both on “does it contain ‘banana’ or ‘shed’?” and “does it say the review is positive or negative?”, just generically those are going to include this XOR type thing, or somehow be entangled in a way that you could linearly separate it based on that?

Vikrant Varma: Yes, that’s right. In particular, these functions should not be things that are of the form “add banana” and “add shed”. They have to be things that do computation that are not linearly separable like that.

Daniel Filan: Okay. Is that true? Has someone checked that? That feels like… Maybe one way you could prove this is to say: well, if we model the charge and the banana/shed as binary variables, there are only so many functions of two binary variables, and if you’ve got a bunch of activations, you’re going to cover all of the functions. Does that sound like a proof sketch?

Vikrant Varma: I’m not sure how many functions of these two variables you should expect the model to be computing. I feel like this depends a bit on what other variables are in the context, because you can imagine that if there’s more than two, if there’s five or six, then these two will co-appear in some of them and not in others. But a thing you can definitely do is you can back out the XOR of these two variables by just linearly probing the model’s activations. I think this effect happens because you’re unable to remove the interaction between these two by just subtracting off the charge.

I would predict that this would also be true in other contexts. I think models are probably computing joint functions of variables in many situations, and the saliency of these will probably depend a bit on the exact context and how many variables there are, and eventually the model will run out of capacity to do all of the possible computations.

Daniel Filan: Sure. Based on the explanation you’ve given, it seems like you would maybe predict that you could get a “banana/shed” probe from a randomly initialized network, if it’s just that the network’s doing a bunch of computation and generically computation tends to entangle things. I’m wondering if you’ve checked that.

Vikrant Varma: Yeah, I think that’s a good experiment to try. That seems plausible. We haven’t checked it, no.

Daniel Filan: Yeah, fair enough. Sorry, I just have so many questions about this banana/shed thing. There’s still a question to me of: even if you think the model represents it, there’s a question of why it would be so salient, because… Your paper has some really nice figures. Listeners, I recommend you check out the figures. This is the figure for the banana/shed experiment, and you show a principal component analysis of basically an embedding of the activation space into three dimensions. And basically what you show is that it’s very clearly divided on the banana/shed things. That’s one of the most important things the model is representing. What’s up with that? That seems like a really strange thing for the model to care so much about.

Vikrant Varma: So I’ll point out firstly that this is a fairly weak pre-trained model, it’s Chinchilla-[70B]. So this model is not going to ignore random things in its prompt. It’s going to “break” the model. That’s one thing that gives you a clue about why this might be salient for the model.

Daniel Filan: So it would be less salient if there were words you expected: the model could just deal with it. But the fact that it was a really unexpected word in some ways, that means you can’t compress it. You’ve got to think about that in order to figure out what’s happening next.

Vikrant Varma: That’s right, yeah. I just expect that there’s text on the internet that looks normal and then suddenly has a random word in it, and you have weird things like, after that point, it just repeats “banana” over and over, or weird things like that. When you just have a pre-trained model, you haven’t suppressed those pathologies, and so the model just starts thinking about bananas at that point instead of thinking about the review.

Daniel Filan: And does that mean you would expect that to not be present in models that have been RLHF‘d or instruction fine-tuned or something?

Vikrant Varma: Yeah, I expect it to be harder to distract models this way with instruction fine-tuned models.

Daniel Filan: Okay, cool. Okay, I have two more questions about that. One thing I’m curious about is: it seems like if I look at the plot of what the CCS method is doing when it’s being trained on this banana/shed dataset, it seems like sometimes it’s at roughly 50/50 if you grade the accuracy based on just the banana/shed and not the actual review positivity. And sometimes it’s 85-90%.

Vikrant Varma: This is across different seeds?

Daniel Filan: Across different seeds, I think. And then if you’re grading it based on whether the review is actually positive or not, sometimes CCS is at 50/50 roughly, sometimes it’s at 85-90%, but it seems like… So firstly, I’m surprised that it can’t quite make its mind up across different seeds. Sometimes it’ll do one, sometimes it’ll do the other. And it seems like in both cases, most of the time it’s at 50/50, and only some of the time it’s 100%. So it seems like sometimes it’s doing a thing that is neither checking if the review is positive or checking if the review is containing “banana” or “shed”. So firstly, does that sound right to you? And secondly, do you have a sense of what’s going on there? Why is it so inconsistent, and why does it sometimes seemingly do a third thing?

Vikrant Varma: Yeah, so I think this is pointing at the brittleness of the CCS method. So someone has an excellent writeup on this. I’m forgetting whether it’s Fabien Roger or Scott Emmons.

Daniel Filan: I think Scott’s doesn’t focus so much on the brittleness, so it might be Fabien.

Vikrant Varma: Okay. But in any case, this person did this experiment where they subtracted off… They found the perfect truth direction that separates true and false statements just using logistic regression. So, using a supervised signal. And then, once you subtract that off, it turns out that there is no other direction, basically, that is able to separate the truth. So, both logistic regression and therefore further CCS just gets random accuracy.

You might hope that CCS, when it works, is finding this perfect direction because there’s only one. But in fact, the CCS probes learned are not close, as in they don’t have high cosine similarity with this direction. So, what’s going on there? I think this is pointing at a kind of optimization difficulty with the CCS method where it’s able to find directions that separate the clusters and get low CCS loss, but are not close to the truth direction. And you would expect this to happen based on the evidence that random probes also classify true and false statements reasonably well in this setup.

So, going back to your original question, I think what’s happening here is that there’s just lots of local minima that achieve good CCS loss. Depending on how you initialize, some of them are close to the truth direction and some of them are not. And if you happen to initialize close to the banana/shed, the distractor direction, then you end up getting a probe like that.

Daniel Filan: Okay. My interpretation was, on the banana/shed example, sometimes CCS is picking up on the truth of the description of the review, sometimes it’s picking up on the banana/shed, and sometimes it’s picking up on other things. Do you know what some of those other things are?

Vikrant Varma: You’re talking about the probes that don’t seem to go for either the distractor or the truth direction?

Daniel Filan: That’s right.

Vikrant Varma: Yeah. I wish I knew what those directions were. In general, it seems very hard to figure out what happens when a CCS probe breaks. And we tried a lot. There’s lots of other experiments that we tried where we were trying to get more interesting failure modes of CCS, and we ended up with these random probes. And then, we looked at examples that the probe was classifying and tried to come up with explanations for what do those clusters mean and it was just kind of hard.

Daniel Filan: Fair enough. You attribute the variance to just optimization difficulties, it sounds like: there being various local minima of the CCS loss. So, the original CCS paper, as you guys note in your appendix, they say that what they’re going to do is they’re going to have basically 10 random seeds, do gradient descent on the CCS objective for each random seed, the seed of the probe parameters, and then they’re going to take the one with the lowest CCS loss and use that.

I take this to basically be their optimization method that’s trying to avoid local minima by starting in 10 places, and hopefully you get a sampling of 10 local minima and you can pick the best one. And basically, it seems like the motivation for that is the thing with the lowest CCS loss is more likely to be the actual truth direction or something. In the banana/shed case, do you happen to know if the probes that scored better on CCS loss were more likely to pick out truth rather than banana/shed?

Vikrant Varma: Yeah. I think the probes that scored lower went for the distractor direction and not the truth direction. This is also visible from the PCA plots where you can see that the distracted direction is more separable.

Daniel Filan: Yeah. I guess maybe one explanation of that is just that it’s easier to tell if a thing ends in banana or shed than it is to tell if something’s positive or negative, especially in the case of… If you think there’s some amount of mislabeling, that could potentially do it.

Vikrant Varma: Yeah.

Daniel Filan: Gotcha. So, that’s an example of one way that CCS can go wrong, with the banana/shed thing. You also have examples where you include in the prompt information about what someone named Alice thinks about this thing, and you describe Alice as an expert, or sometimes you say Alice is anti-capitalist, and even when a thing is about a company, she’s not going to say that it’s about a company.

In the case of Alice the expert, it seems like the probes learn to agree with Alice more than they learn about the ground truth of the thing.

Vikrant Varma: Yeah. I think there’s two separate experiments, if I remember correctly. One is where you modify the prompt to demonstrate more expertise. So, you have a default prompt, a professor prompt, and a literal prompt. And then, there’s a separate experiment where you have an anti-capitalist character called Alice.

Daniel Filan: I’m meaning a third one where at the start you say “Alice is an expert in movie reviews” and you give the review and then you say, “Alice thinks the sentiment of this review is positive.” But what Alice says is actually just randomly assigned. And in that case, the prompts tend to pick up on agreement with Alice more than agreement with the ground truth. That seems vaguely concerning. It almost seems like a human failure mode. But I’m wondering, do you know how much of it was due to the fact that Alice was described as an expert who knows about stuff?

Vikrant Varma: Yeah. I think, in general, an issue with CCS is that it’s unclear whether CCS is picking up something about the model’s knowledge, or whether the thing that’s salient is whatever the model is doing to compute the next token. And in a lot of our experiments, the way we’ve set it up is to nudge the model towards completing in a way that’s not factually true. For example, in the “Alice is an expert in movie reviews” [case], the prompt is set up in a way that nudges the model to complete in Alice’s voice. And the whole promise of CCS is that even when the outputs are misleading, you should be able to recover the truth.

I think even from the original CCS paper, you can see that that’s not true because you have to be able to beat zero-shot accuracy with quite a large margin to be confident about that. This is one maybe limitation of being able to say things about CCS, which is that you’re always unsure whether CCS is… Even the thing that you’re showing, are you really showing that the model is computing Alice’s belief? Or are you just showing that your probe is learning what the next token prediction is going to be?

Future CCS-like approaches

Daniel Filan: Sure. Yeah. You have a few more experiments along these lines. I guess I’d like to talk a bit about: I think of your paper as saying there’s a theoretical problem with CCS, which is that there’s a bunch of probes that could potentially get low CCS loss, and there’s a practical problem, which is some probes do get low CCS loss. So, if I think about the CCS research paradigm, I think of it as… When the CCS paper came out, I was pretty into it. I think there were a lot of people who were pretty into it. Actually, part of what inspired that Scott Emmons post about it is I was trying to sell him on CCS and I was like, “No, Scott, you don’t understand. This is the best stuff since sliced bread.” And I don’t know, I annoyed him enough into writing that post. So, I’ll consider that a victory for my annoying-ness.

But I think the reason that I cared about it wasn’t that I felt like literal CCS method would work, but it was because I had some sense of just the general strategy, of coming up with a bunch of consistency criteria and coming up with a probe that cares about those and maybe that is going to isolate belief. So, if we did that, it seems like it would deal with stuff like the banana/shed example. If you cared about more relations between statements, not just negation consistency, but if you believe A, and A implies B, then maybe you should believe B, just layer on some constraints there. You might think that by doing this we’re going to get closer to ground truth. I’m wondering, beyond just CCS specifically, what do you think about this general strategy of using consistency constraints?

Vikrant Varma: Yeah. That’s a great question. I think my take on this is informed a lot by a comment by Paul Christiano on one of the CCS review posts. I basically share your optimism about being able to make empirical progress on figuring out what a model is actually doing or what it’s actually thinking about a situation by using a combination of consistency criteria, and even just supervised labels in situations where you know what the ground truth is. And being able to get reasonably good probes - maybe they don’t generalize very well, but every time they don’t generalize or you catch one of these failures, you spend a bunch of effort getting better labels in that situation. And so, you’re mostly not in a regime where you’re trying to generalize very hard.

And I think this kind of approach will probably work pretty well up to some point. I really liked Paul’s point that if you’re thinking about a model that is saying things in human natural language and it’s computing really alien concepts that are required for superhuman performance, then you shouldn’t necessarily expect that this is linearly extractable or extractable in a simple way from the activations. This might be quite a complicated function of the activations.

Daniel Filan: Why not?

Vikrant Varma: I guess one way to think about it is that the natural language explanation for a very complicated concept is not going to be short. So, I think the hypothesis that a lot of these concepts are encoded linearly and are linearly extractable… In my mind, it feels pretty unclear whether that will continue to hold.

Daniel Filan: Okay. So just because “why does it have to be linear?” There are all sorts of ways things can be encoded in neural nets.

Vikrant Varma: Yeah. That’s right. And in particular, one reason you might expect things to be linear is because you want to be able to decode them into natural language tokens. But if there is no short decoding into natural language tokens for a concept that the model is using, then it is not important for the computation to be easily decodable into natural language.

Daniel Filan: Right. So, if the model’s encoding whether a thing is actually true according to the model, it’s not like that determines the next thing the person will say, right?

Vikrant Varma: Right. It’s a concept that humans are not going to talk about, it’s never going to appear in human natural language. There’s no reason to decode this into the next token.

Daniel Filan: This is talking about: if the truth of whatever the humans are talking about, it actually depends on the successor of a theory of relativity that humans have never thought about, it’s just not really going to determine the next thing that humans are going to say.

Vikrant Varma: Yeah, that’s an example.

Daniel Filan: Yeah. I take this critique to be firstly a critique of linear probes for this task. I guess you can form a dilemma where either you’re using linear probes, and then you don’t necessarily believe that the thing is linearly extractable, or you’re using complicated non-linear probes, and then maybe the stuff you’re getting out is stuff about your probe rather than stuff about the underlying model. But then, I guess there’s a separate question of, are there consistency constraints that could work? Putting aside the probe… I don’t know, maybe we shouldn’t put aside the probe thing, but putting aside the probe thing, is there some sort of consistency check we could do to say, is this property we found in the model the model’s actual beliefs, or is it not?

Vikrant Varma: Yeah. That’s a good question. I think the more powerful your model, the more entities it’s “simulating” whose beliefs end up mattering for next token prediction that the model is doing. And if these entities that the model is thinking about, if their beliefs also satisfy all of the same consistency criteria that you’re using, then you just have a very fundamental indistinguishability problem. And in particular, I think the more powerful the model gets, the more pessimistic I am that we are able to come up with consistency checks that correctly distinguish between true beliefs and beliefs of simulated entities.

Daniel Filan: One point you make in a LessWrong post related to your paper, is that if you’re a smart agent and you’re thinking about a proposition, one thing you might sometimes think about is, what’s the world in which I’m wrong about this proposition? So, you’re thinking about a belief structure where the truth value of this proposition is the exact opposite of the thing you actually believe.

Vikrant Varma: That’s right.

Daniel Filan: Which seems like a pretty nice impossibility proof, if you actually are representing both of these things.

Vikrant Varma: Yeah. This is also what I meant by adversaries. You might be in a world where an adversary is trying to make you believe the opposite of what’s true. And now, this adversary is modeling the world in which you would have to believe this and all the evidence that would have to be true.

Daniel Filan: Right. And the language model is modeling what the adversary wants you to think or something.

Vikrant Varma: That’s right. So, that’s another example. But I think maybe there’s a different kind of hope. I think the truth is represented simpler than agents beliefs, might be a hypothesis that turns out to be somewhat true up to some point. But I think this is a different kind of criteria from consistency. So, now we’re talking about, mechanistically, how are these concepts represented inside the model? And for example, if you had a way to say, “The model’s true beliefs are always linearly represented, but beliefs about other agents, they’re not linearly represented; and therefore, we can be confident that linear probes are recovering truth.” That would be an example of a mechanistic structure that you might hope to use. I think we don’t currently have any candidates for that, but it’s a good area to look at.

Daniel Filan: Yeah. Are there things analogous to that that we’ve learned? Basically, I’m trying to wonder: if I wanted to prove or disprove this, what kind of thing would I do? And the one thing I can think of is there’s some research about: do convolutional neural networks learn texture or color first? And it turns out there’s a relatively consistent answer. I’m wondering if you can think of any analogous things about neural networks that we’ve learned that we can maybe…

Vikrant Varma: Yeah. There’s quite a few options presented in the eliciting latent knowledge report. So for example, one of the things you might hope is that if the model is simulating other entities, then maybe it’s trying to figure out what’s true in the world before it does that. And so, you might expect earlier belief-like things to be true, and later belief-like things to be agents’ beliefs.

Or similarly, you might expect that if you try to look for things under a speed prior, as in beliefs that are being computed using shorter circuits, then maybe this is more likely to give you what’s actually true, because it takes longer circuits to compute that plus what some agent is going to be thinking. So, that’s a structural property that you could look for.

Daniel Filan: Yeah. I guess it goes back to the difficulty of eliciting latent knowledge. In some ways, I guess the difficulty is: if you look at standard Bayesian rational agent theory, the way that you can tell that some structure is an agent’s beliefs is that it determines how the agent bets and what the agent does. It tries to do well according to its own beliefs. But if you’re in a situation where you’re worried that a model is trying to deceive you, you can’t give it Scoobie snacks or whatever for saying things that… You can’t hope to get it to bet on its true beliefs, if you’re going to allow it access to the world based on whether you think its true beliefs are good, or stuff like that. I don’t know, it seems tricky.

CCS as principal component analysis

Daniel Filan: I have some other minor questions about the paper. Firstly, we mentioned this post by Scott Emmons, and one of the things he says is that principal component analysis, this method where you find the maximum-variance direction and just base your guess on the model beliefs based on where the thing lies in this maximum-variance direction. He says that this is actually similar to CCS in that you’re encoding something involving confidence and also something involving coherence. And that might explain why PCA and CCS are so similar. I’m wondering what do you think about that take?

Vikrant Varma: Is a summary of this take that most of the work in CCS is being done by the contrast pair construction rather than by the consistency loss?

Daniel Filan: It’s partly that, and also partly if you decompose “what’s the variance of X minus Y”, you get expectation of X squared plus expectation of Y squared minus twice the [expectation] of XY, and then some normalization terms of variance of X squared… Sorry. Expectation of X all squared, expectation of Y all squared, and then another covariance term. Basically, he’s saying like, “Look, if you think of a vector that maximizes the outer product of that vector, the variance and itself, you’re maximizing the outer product of the variant of that vector with expectation X squared plus expectation Y squared.” Which ends up being the confidence of classification according to that vector.

And then, you’re subtracting off the covariance, which is basically saying, is the vector giving high probability for both yes and no? Or is the vector giving low probability for both yes and no? And so, basically, the take is just because of the mathematical properties of variance and what PCA is doing, you end up doing something kind of similar to PCA. I’m wondering if you have thoughts on this take?

Vikrant Varma: Yeah, that’s interesting. I don’t remember reading about this. It sounds pretty plausible to me. I guess one way I’d think about it intuitively is that if you’re trying to find a classifier on these difference vectors, contrast pair difference vectors, then for example, you want to be maximizing the margin between these two. And this is a bit like trying to find a high contrast thing. So overall, it feels plausible to me.

Explaining grokking through circuit efficiency

Daniel Filan: Gotcha. Okay. So, if it’s all right with you, I’d like to move on to the paper ‘Explaining grokking through circuit efficiency’.

Vikrant Varma: Perfect. Let’s do it.

Daniel Filan: Sure. This is a paper you wrote with Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. You’re explaining grokking. For people who are unaware or need a refresher, what is grokking?

Vikrant Varma: So in 2021, Alethea Power and other people at OpenAI noticed this phenomenon where when you train a small neural network on an algorithmic task, initially, their network overfit, so it got very low training loss and high test loss. And then, they continued training it for about 10 times longer and found that it suddenly generalized. So, although training loss stayed low and about the same, test loss suddenly fell. And they dubbed this phenomenon “grokking”, which I think comes from science fiction and means “suddenly understanding”.

Why research science of deep learning?

Daniel Filan: Okay, cool. And basically, you want to explain grokking. I guess a background question I have is, it feels like in the field of AI alignment, or people worried about AI taking over the world, there’s a sense that it’s pretty important to figure out grokking and why it’s happening. And it’s not so obvious to me why it should be considered so important, given that this is a thing that happens in some toy settings, but to my knowledge, it’s not a thing that we’ve observed on training runs that people actually care about. So I guess it’s a two-part question: firstly, just why do you care about it? Which could be for any number of reasons. And secondly, what do you think its relationship is to AI safety and AI alignment?

Vikrant Varma: I think back in 2021, there were two reasons you could have cared about this as an alignment researcher. One is on the surface it looks a lot like a network was behaving normally, and then suddenly it understood something and started generalizing very differently. The other reason is this is a really confusing phenomenon in deep learning, and it sure would be good if we understood deep learning better. And so, we should investigate confusing phenomena like grokking, [even] ignoring the superficial similarity to a scenario that you might be worried about.

Daniel Filan: Okay. Where the superficial scenario is something like: the neural network plays nice, and then suddenly realizes that it should take over the world, or something?

Vikrant Varma: That’s right. And I think I can talk a bit more about the second reason or the overall science of deep learning agenda, if you like. Is that a useful thing to go into now?

Daniel Filan: I guess maybe why are you interested in grokking?

Vikrant Varma: For me, grokking was one of those really confusing phenomena in deep learning, like deep double descent or over-parameterized networks generalizing well, that held out some hope of if you understand this phenomenon, maybe you’ll understand something pretty deep about how we expect real neural networks to generalize and what kinds of programs we expect deep learning to find. It was a puzzling phenomenon that somebody should investigate, and we had some ideas for how to investigate it.

Daniel Filan: Gotcha. I’m wondering if you think just, in general, AI alignment people should spend more effort or resources on science of deep learning issues. Because there’s a whole bunch of them, and not all of them have as much presence from the AI alignment community.

Vikrant Varma: I think it’s an interesting question. I want to decompose this into how dual-use is investigating science of deep learning, and do we expect to make progress and find alignment-relevant things by doing it? And I’m mostly going to ignore the first question right now, but we can come back to it later if you’re interested. I think for the second question, it feels pretty plausible to me that investigating science of deep learning is important and tractable and neglected. I should say that a lot of my opinions here have really come from talking to Rohin Shah about this, who is really the person who’s, I think, been trying to push for this.

Why do I think that? I think it’s important because: similar to mechanistic interpretability, the core hope for science of deep learning would be that you’re able to find some information about what kinds of programs your training process is going to learn, and so therefore, how it will generalize in a new situation. And I think a difference from mech[anistic] interp[retability] is… This is maybe a simplified distinction, but one way you could draw the line is that mech. interp. is more focused on reverse-engineering a particular network and being able to point at individual circuits and say, “Here’s how the network is doing this thing.”

Whereas, I think science of deep learning is trying to say, “Okay. What kinds of things can we learn in general about a training process like this with a dataset like this? What are the inductive biases? How does the distribution of programs look like?” And probably both science of deep learning, and mech. interp. have quite a lot of overlap, and techniques from each will help the other. That’s a little bit about the importance bit.

I think it’s both tractable and neglected in the sense that we just have all of these confusing phenomena. And for the most part, I feel like industry incentives are not that aligned with trying to investigate these phenomena really carefully and doing a very careful natural sciences exploration of these phenomena. So in particular, iterating between trying to come up with models or theories for what’s happening and then making empirical predictions with those theories, and then trying to test that, which is the kind of thing we tried to do in this paper.

Daniel Filan: Okay. Why do you think industry incentives aren’t aligned?

Vikrant Varma: I think it’s quite a high risk, high reward sort of endeavor. And in the period where you’re not making progress on making loss go down in a large model, it’s maybe harder to justify putting a lot of effort into that. On the other hand, if your motivation is “If we understood this thing, it could be a really big deal for safety”, I think making the case as an individual is easier. Even from a capabilities perspective, I think the incentives to me seem stronger than what people seem to be acting on.

Daniel Filan: I guess there’s something puzzling about why there would be this asymmetry between some sort of corporate perspective and some sort of safety perspective. I take you to be saying that, “Look, there are some insights to be found here, but you won’t necessarily get them tomorrow. It’ll take a while, it’ll be a little bit noisy. And if you’re just looking for steady incremental progress, you won’t do it.” But it’s not obvious to me that safety or alignment people should care more about steady incremental progress than people who just want to maximize the profit of their AI, right?

Vikrant Varma: You mean [“safety people should] care less about that”?

Daniel Filan: Yeah. It’s not obvious to me that there would be any difference.

Vikrant Varma: Right. I think one way you could think about it, from a safety perspective, is multiple uncorrelated bets on ways in which we could get a safer outcome. I think probably a similar thing applies for capabilities except that… And I’m really guessing and out of my depth here, but my guess would be that for whatever reason, it’s harder to actually fund this kind of research, this kind of very exploratory, out-there research, from a capabilities perspective, but I think there is a pretty good safety case to make for it.

Daniel Filan: Yeah, I guess it’s possible that it’s just a thing where it’s hard to … I don’t know, if I’m a big company, right, I want to have some way of turning my dollars into people solving a problem. One model you could have is for things that could be measured in “how far down did the loss go?” It’s maybe just easier to hire people and be like, “Your job is to put more GPUs on the GPU rack” or “your job is to make the model bigger and make sure it still trains well”. Maybe it’s harder to just hire a random person off the street and get them to do science of deep learning. That’s potentially one asymmetry I could think of.

Vikrant Varma: Yeah, I think it’s also just: I genuinely feel like there are way fewer people who could do science of deep learning really well than people who could make the loss go down really well. I don’t think this fundamentally needs to be true, but it just feels true to me today based on the number of people who are actually doing that kind of scientific exploration.

Daniel Filan: Gotcha. When I asked you about the alignment case for science of deep learning, [you said] there’s this question of dual use and then there was this question of what alignment things there might be there, and you said you’d ignore the dual use thing. I want to come back to that. What do you think about: some people say about interpretability or stuff, “well, you’re going to find insights that are useful for alignment, but you’re also going to find insights that are useful for just making models super powerful and super smart, and it’s not clear if this is good on net”.

Vikrant Varma: Yeah. I want to say that I feel a lot of uncertainty here in general, and I think your answers to these questions kind of depend a lot on how you expect AI progress to go and where you expect the overhangs to be and what sort of counterfactual impact you expect. What kinds of things will capabilities people do anyway, for example?

Yeah, so I think to quickly summarize one story that I find plausible, it’s that we’re basically going to try and make progress about as fast as we can towards AGI-level models. Hopefully, if we have enough monitoring and red lines and RSPs in place, if there is indeed danger as I expect, then we will be able to coordinate some sort of slow down or even pause as we get to things that are about human-level.

Then, a story you could have for optimism is that: well, we’re able to use these roughly human-level systems to really make a lot of progress in alignment, because it becomes clear that that’s the main way in which anybody can use these systems safely, or that’s how you construct a strong positive argument for why the system is safe rather than just pointing at an absence of evidence that it’s unsafe, and we’re in that sort of world, and then just a bunch of uncertainty about how long that takes. In the meantime, presumably we’re able to coordinate and prevent random other people who are not part of this agreement from actually racing ahead and building an unsafe AGI.

Under that story, I think, it’s not clear that you get a ton of counterfactual capabilities progress from doing mech. interp. or science of deep learning. It mostly feels to me like we’ll get there even without it and that to the degree that these things are going to matter for capabilities, a few years from now, capabilities people are going to start [doing], maybe not science of deep learning if it’s very long-term and uncertain, but definitely mech. interp.: I expect capabilities people to start using those techniques and trying to adapt them for improving free training and so on.

Like I said, I feel pretty uncertain. I am pretty sympathetic to the argument that all of this kind of research like mech. interp. and science of deep learning should basically be done in secret… If you’re concerned about safety and you want to do this research, then you should do it in secret and not publish. Yeah, I feel sympathetic to that.

Summary of the paper’s hypothesis

Daniel Filan: Gotcha. I guess with that background, I’d like to talk about the paper. I take the story of your paper to basically be saying: look, here’s our explanation of grokking. Neural networks… you can think of them as a weighted sum of two things they can be doing. One thing they can be doing is just memorizing the data, and one thing that they can be doing is learning the proper generalizing solution.

The reason you get something like grokking is that it takes a while … Networks are being regularized, according to the norm of their parameters; and the generalizing circuit - the method that generalizes - it can end up being more confident for a given norm of parameter. And so eventually it’s favored, but it takes a while to learn it. Initially you learn to just memorize answers, but then as there’s this pressure to minimize the parameter norm that comes from some form of regularization, you become more and more incentivized to try and figure out the generalizing solution, and the network eventually gets there, and once gradient descent comes to the vicinity of the generalizing solution, it starts moving towards that, and that’s when grokking happens.

And basically from this perspective, you come up with some predictions… you come up with this thing called ungrokking, which we can talk about later; you can say some things about how confidence should be related to parameter norm in various settings… but I take this to be your basic story. Does that sound like a good summary?

Vikrant Varma: Yeah, I think that’s pretty good.

What are ‘circuits’?

Daniel Filan: Gotcha. I guess the first thing that I’m really interested in is: in the paper you talk about ‘circuits’, right? You say that there’s this ‘memorizing circuit’ and this ‘generalizing circuit’. You have a theoretical model of them, and you have this theoretical model of: imagine if these circuits were competing, what would that look like? But to the best of my understanding from reading your paper, I don’t get a clear picture of what this ‘circuit’ talk corresponds to in an actual model. Do you have thoughts about what it does correspond to in an actual model?

Vikrant Varma: Yeah, that’s a good question. We borrowed the circuit terminology from the circuits thread by Chris Olah in Anthropic. There, they define a circuit as a computational subgraph in the network. I think this is sufficiently general or something that it applies to our case. Maybe what you’re asking though is more: physically, where is the circuit inside the network?

Daniel Filan: If I think of it as a computational subgraph, the memorization circuit is going to take up a significant chunk of the network, right? Do you think I should think of there being two separate subgraphs that aren’t interacting very much, one of which is memorization, one of which is generalization, and just at the end we upweight the generalization and downweight the regularization?

That would be weird, because there’s going to be crosstalk that’s going to inhibit the memorizing circuit from just purely doing memorization and the generalizing circuit from purely doing generalization. When I try to picture what’s actually going on, it seems difficult for me. Or I could imagine that the memorizing circuit is just supposed to be one parameter setting for the network and the generalizing circuit is supposed to be another parameter setting and we’re linearly interpolating that. But neural networks, they’re non-linear in their parameters, right? You can’t just take a weighted sum of two parameter vectors and get away some of the output. So yeah, this is my difficulty with the subgraph language.

Vikrant Varma: I want to make a distinction between the model or the theory that we’re using to make our predictions, and how these circuits are implemented in practice. In the model or in our theory, these circuits are very much independent, so they have their own parameter norms and the only way they interact is they add at the logit stage. And this is completely unrealistic, but we’re able to use this very simplified model to make pretty good predictions.

I think the question of how circuits in this theory are actually implemented in the network is something that I would love to understand more about. We don’t have a great picture of this yet, but I think we can probably say some things about it already. One thing we can say is that there are definitely not going to be disjoint sets of parameters in the network.

Some evidence for this is things like: in terms of parameters, there’s a lot of overlap between a network that’s memorizing and that later generalizes, as in a lot of the parameter norm is basically coming from the same weights. And the overlap is way more than random. And this is probably because when the network is initialized, there’s some parameters that are large and some that are small and both circuits learn to use this distribution, and so there ends up being more overlap there.

Daniel Filan: Okay. My summary from that is you’re like, “okay, there are probably in some sense computational subgraphs and they probably overlap a bit, and we don’t have a great sense of how they interact”.

Vikrant Varma: Yeah.

Daniel Filan: One key point in your model is in the simplified model of networks, where they’re just independent things that get summed at the end, eventually you reduce your weight on the memorizing circuit and increase your weight on the generalizing circuit. Do you have a sense of, if I should think of this as just literally increasing and decreasing weights, or circuits cannibalizing each other somehow?

Vikrant Varma: Yeah, maybe closer to cannibalizing somehow if there’s a lot of competition for parameters between the two circuits. I think in a sense it is also going to be increasing or decreasing weights, because the parameter norm is literally going up or down. It’s just not going to happen in the way we suggest in the model, where you have a fixed circuit and it’s just being multiplied by a scalar.

In practice, there’s going to be all kinds of things. For example, it’s more efficient under L2… if you have a circuit, instead of scaling up the circuit by just multiplying all the parameters, it’s more efficient to duplicate it if you can, if you have the capacity in the network.

I also imagine that there are multiple families of circuits that are generalizing and memorizing and within each family, these circuits are competing with each other as well. And so you start off with a memorizing circuit and instead of just scaling it down or up, it’s actually morphing into a different memorizing circuit with a different distribution of parameters inside it. But the overall effect is close enough to the simplified model that it makes good predictions.

The role of complexity

Daniel Filan: Sure. I’m wondering: one thing this theory reminded me of is singular learning theory, which is this trendy new theory of deep learning [that] people are into. Basically it comes from this insight where: if you think about Bayesian inference in high dimensional parameterized model classes, which is sort of like training neural networks, except we don’t actually use Bayesian inference for training neural networks… If the model class has this property called “being singular”, then you end up having phase transitions of: sometimes you’re near one solution and then as you get more data, you can really quickly update to a different kind of solution, where basically what happens is you’re trading off some notion of complexity of different solutions for predictive accuracy.

Now, in the case of increasing data, it’s kind of different because the simplest kinds of phase transitions you can talk about in that setting are as you get more data, whereas you’re interested in phase transitions in number of gradient steps, but they both feature this common theme of “some notion of complexity being traded off with accuracy”. And if you favor minimizing complexity somehow, you’re going to end up with a low complexity solution that meets accuracy. I mean, that’s kind of a superficial similarity, but I’m wondering what you think of the comparison there.

Vikrant Varma: Yeah, so I have to admit that I know very little about singular learning theory and I feel unable to really compare what we’re talking about with SLT.

Daniel Filan: Fair enough.

Vikrant Varma: I will say though that this notion of lower weight norm being less complex somehow is quite an old idea. In particular, Seb Farquhar pointed me to this 1993 paper, I think by Geoffrey Hinton, which is about motivating L2 penalty from a minimum description length angle. So if two people are trying to communicate all of the information that’s contained inside a model, they could have some priors about what the weights of the model are, and then they need to communicate both something about the dataset as well as errors that the model is going to make. And in this paper, they use these Gaussianity assumptions and are able to derive both mean squared error loss and L2 penalty as an optimal way to communicate between these two people.

Daniel Filan: And this seems similar to the classic result that L2 regularization is sort of like doing Bayesian inference with a Gaussian [prior], just because if your prior is Gaussian, then you take the log of that and that ends up being the norm and that’s the log likelihood for you.

Many kinds of circuits

Daniel Filan: Sure. So I guess I’d like to pick up on this thing you said about there being multiple kinds of circuits, because there’s a sentence that jumped out to me in your paper. You’re looking at doing a bunch of training runs and looking at trying to back out what you think is happening with the generalizing and memorizing circuits, and you say that the random seed starting training causes significant variance in the efficiency of the generalizing and memorizing solutions.

That kind of surprised me, partly because I think that there just can’t be that many generalizing solutions. We’re talking about fairly simple tasks like “add two numbers, modulo 113”, and how many ways can there be to do that? I recently learned that there’s more than one, but it seems like there shouldn’t be a million of them. Similarly, how many ways can there be to memorize a thing?

And then also, I would’ve thought that gradient descent would find the most efficient generalizing circuit or the most efficient memorizing circuit. So yeah, I’m wondering if you have thoughts about how I should think about this family of solutions with seemingly different efficiencies.

Vikrant Varma: One thing I’ll point out is that even with the trigonometric algorithm for doing modular addition, this is really describing a family of algorithms because it depends on which frequencies in particular the network ends up using to do the modular addition.

Daniel Filan: And if people are interested in that algorithm, they can check out my episode with Neel Nanda. You can probably check out other things, but don’t leave AXRP please. So yeah, [with] this algorithm, you pick some frequencies and then you rotate around the circle with those frequencies to basically do the clock algorithm for modular arithmetic, but you can pick which frequencies you use.

Vikrant Varma: Yeah, that’s right.

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.

Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

The other thing I want to draw attention to is in many deep learning problems, it is not the case… Deep learning practitioners are very familiar with the observation that different random seeds end up producing networks with different test performance. And if you’re trying to create a state-of-the-art network for solving some tasks, it’s quite common to run 100 seeds and then pick the top five best-performing ones or whatever. I think it’s just not clear from this that gradient descent or Adam is able to find the optimal solution from any initialization.

And I think this also shows up not just when you vary random seed, but it shows up epoch-wise. Because for example, with one of the phenomena you mentioned from our paper, semi-grokking, you see these multiple phase transitions where the network is switching between generalizing circuits that have different levels of efficiency, and these levels of efficiency are quite far apart so that you can actually visibly see the change in test performance as it switches in a very discrete manner between these circuits. And if it was really the case that gradient descent could find the optimal solutions, then you wouldn’t expect to see this kind of switching.

How circuits are learned

Daniel Filan: Gotcha. Yeah, there are just so many questions I have about these circuits. I’m not sure you have any answers, but it just brings up… Part of your story is that it takes a while to learn the generalizing solution, longer than it takes to learn the memorizing solution. Do you maybe have thoughts about why that might be?

Vikrant Varma: I think my thoughts here are mostly what we’ve written down in the paper and I feel like this is another area that’s pretty ripe for understanding. The explanation we offer in the paper is mostly inspired by a blog post by Buck [Shlegeris], and I think another person whose name I’m forgetting.

Daniel Filan: Ryan Greenblatt? [Update: it’s actually Adam Jermyn]

Vikrant Varma: Yes, that’s right. And the explanation there is that maybe memorizing circuits basically have fewer independent components that you need in order for the circuit to develop, but a generalizing circuit is going to need multiple components that are all needed for good performance.

Here’s the simplified model: the memorizing network is implemented with one parameter and that just scales up or down, and the generalizing network is implemented with two parameters that are multiplied together to get the correct logit, and so the gradient of the output with respect to any of the parameters depends on the value of the other parameter in the generalizing circuit case.

And so if you simulate this forward, what you get for the generalizing circuit is a kind of sigmoid where initially both parameters are quite small and they’re not contributing that much to each other’s growth, and then once they start growing, both of them grow quite a lot and then it plateaus out because of L2.

Daniel Filan: Sure. Should I think of this as just a general observation that in evolution, if you need multiple structures and they both depend on each other for that to work, that’s a lot harder to evolve than a structure that is a little bit useful on its own and another structure that is a little bit useful on its own? And for memorizing a solution, I can memorize one thing and then I can memorize the second thing and I can do those independently of each other, so each little bit of memorization is evolvable on its own maybe?

Vikrant Varma: Yes, I think that’s right. Maybe another way I think about it is that the memorization circuit is basically already there, [in a] random network. And really the thing you’re learning is the values that you have to memorize. And as you say, that’s independent for each point, but that’s not the case for [the] generalizing circuit.

I think another important ingredient here is that there needs to be at least some gradient at the beginning towards the generalizing circuit if it has multiple components. And it’s kind of an open question in my mind why this happens. The most plausible theory I’ve heard is something like lottery tickets, where basically the randomly initialized network has very weak versions of the circuits that you want to end up with. And so there’s a tiny but non-zero gradient towards them.

Daniel Filan: Interesting. Yeah, I guess more work is needed. I’d like to talk a little bit about … The story of: it takes you a while to learn the generalizing circuit. You learn the memorizing circuit and it takes you a while to learn the generalizing circuit, but once you do, then that’s grokking.

This is sort of reminiscent of a story that I think was in an appendix of a paper, Progress Measures for Grokking? It was progress measures for something by Neel Nanda et al. And they have this story where there are three phases of circuit formation. There’s memorization and then there’s learning a generalizing solution, and then there’s cleaning up the memorized stuff, right? And in their story, they basically demarcate these phases by looking at the activations of their model and figuring out when the activations are representing the algorithm that is the generalizing solution according to them.

And so it seems pretty similar to your story, but one thing that struck me as being in a bit of tension is that they basically say grokking doesn’t happen when you learn the generalizing solution, it happens when you clean up the parameters from the memorizing solution. Somehow there’s one stage of learning the generalizing solution and then a different phase of forgetting the memorizing solution. So I’m wondering what you think about the relationship between that story in their paper and your results.

Vikrant Varma: I think one thing that’s interesting to think about here is the relationship between logits and loss, or between logits and accuracy. Why is the memorization cleanup important? It’s because to a first approximation, the loss is dependent on the difference between the highest logit and the second highest logit. And if you have this memorization circuit that is kind of incorrectly putting high weight on the incorrect logit, then when it reduces, you’ll see a very sharp cleanup effect.

I think this is something that we haven’t really explored that much because the circuit efficiency model is mostly talking about the steady state that you expect to end up in and is not so much talking about the dynamics between these circuits as time goes on. This is a part of the story of grokking that is very much left unexplained in our paper, which is why exactly is the generalizing circuit developing slower? But if you put in that sigmoid assumption, as I was talking about, artificially, then the rest of the theory is entirely sufficient to see exactly the same kinds of grokking curves as you see in actual grokking.

Daniel Filan: But maybe I’m misunderstanding, but under your story, I think I would’ve expected a visible start of grokking, or visible increase in test accuracy during the formation of the generalizing circuits, rather than it waiting until the cleanup phase.

Vikrant Varma: Right. By formation, are you talking about … there’s this phase where the generalizing circuit is developing and it’s there, but the logits that it’s outputting are way lower than the memorization logits. And in that phase, I basically don’t expect to see any change in test accuracy.

And then there’s the phase where the generalizing circuit logits cross the memorizing circuit for the first time. And this is maybe another place where there’s a difference between the toy model and what you actually observe. In the toy model, the thing you would expect to see is a jump from 0% accuracy as it’s just below the equality threshold to 100% accuracy the moment the generalizing logit crosses the memorizing logit, because that changes what the highest strength logit is.

But in practice, we see the accuracy going through all these intermediate phases, so it goes through 50% before getting to 100%. And the reason that’s happening is because there are some data points which are being correctly classified and some which are incorrectly classified.

And so this is pointing to a place where the theory breaks down, where on some of the points, the generalizing circuit is making more confident predictions than on other points, which is why you get these intermediate stages of accuracy, so that’s one thing.

This also suggests why the cleanup is important to get to full accuracy. If the memorizing circuit is making these high predictions on many points, even when the generalizing circuit is around the same level, because of the variance, you just need the memorizing circuit to disappear before you really get 100% prediction accuracy.

Daniel Filan: Right, so the story is something like: you start learning the generalizing circuit and you start getting the logits being somewhat influenced by the generalizing circuit, but you need the logits to start being somewhat near each other to get a hope of making a dent in the loss. And for that to happen, you more need the memorization circuits to go away. There’s the formation of the circuit, and then there’s switching the weights over, is roughly the story that I’m thinking of.

Vikrant Varma: Yeah, that’s right.

Semi-grokking and ungrokking

Daniel Filan: Gotcha. There’s something that you mentioned called semi-grokking and ungrokking. Actually, can you describe what they are?

Vikrant Varma: Sure. I’ll start with how should you think about the efficiencies of these two different circuits. If you have a generalizing circuit that is doing modular addition, then if you add more points to the training set, it doesn’t change the algorithm you need to get good training loss. And so you shouldn’t really expect any modification to the circuit as you add more training points. And therefore, the efficiency of the circuit should stay the same. Whereas if you have a memorizing circuit, then as you add more points, it needs more weights to memorize those points, and so you should expect the efficiency to be dropping as you add more points.

Daniel Filan: Yeah, or another way I would think about this is that if I’m memorizing n points - I figured out the most efficient way to memorize the n points - the (n+1)th point, I’m probably not going to get that right because I just memorized the first ones, so I’ve got to change to memorize the (n+1)th point, and I can’t change in a way that makes me more efficient because I was already the most efficient I could be on the n points. And so even just at a macro level, just forgetting about adding weights or whatever, it just has to be the case that memorization is losing efficiency the more you memorize whereas generalization, you wouldn’t expect it to have to lose efficiency.

Vikrant Varma: Yeah. Yeah, that’s a good way to explain it.

Daniel Filan: And of course I stole that from your paper. I don’t want to act like I invented that.

Vikrant Varma: No, but it’s a good explanation. So I think a direct consequence of this… so when you couple that with the fact that at very low dataset sizes, it appears that memorization is more efficient than generalization, then you can conclude that there must be a dataset size where memorization is increasing as you increase the dataset size, generalization parameter norm is staying the same… There must be a crossover point. And then you can ask the question: what happens at that crossover point, when you have a dataset size where the efficiency of generalization is equal to the efficiency of memorization?

And so we did some maths in our toy model, or our theoretic model I should say, and came up with these two cases, these two different things that could happen there. And this really depends on the relationship between… When you scale the parameters by some amount, how does that scale the logits? And if it scales the logits by more than some threshold, then it turns out that at this equality point you will just end up with a more efficient circuit period. But if the scaling factor is lower than some threshold, then you will actually end up with a mixture of both the memorizing and the generalizing circuits. And the reason for this is: because you’re not able to scale the logits as much, it’s more efficient to allocate the parameter norm between these two different circuits when you are considering the joint loss of L2 plus the data loss.

Daniel Filan: Okay, so something like… it’s sort of the difference between convex and concave optimization, right? You’re getting diminishing returns per circuit, and so you want to invest in multiple circuits rather than going all in on one circuit. Whereas in some cases, if you have increasing returns, then you just want to go all in on the best circuit.

Vikrant Varma: Yeah, that’s right. And in particular, the threshold is like… there’s quite a cool way to derive it, which is that the L2 is scaling as the square of the parameter norm. So if the logits are scaling faster than that, then you’re able to overcome the parameter penalty by just investing in the more efficient circuit. But if they’re not scaling faster than that, then you have to split. And so the threshold ends up being if you’re able to scale the logits faster than to the power of two.

Daniel Filan: Okay. So you have this semi-grokking and ungrokking, right? Where you’re training on this subset of your training dataset and you lose some test accuracy - either some of it or all of it - basically by partly or fully reverting to the memorizing solution. So this is an interesting phenomenon because… maybe you know better than me, but I’m not aware of people talking about this phenomenon or connecting it to grokking before. Or they’ve talked about the general phenomenon of catastrophic forgetting, where you train your network on a different dataset and it forgets stuff that [it] used to know. But in terms of training on a subset of the dataset, I’m not aware of people discussing that before or predicting that before. Is that right?

Vikrant Varma: Yeah, I think that’s right. So we got a lot of questions from reviewers about “how is ungrokking any different from catastrophic forgetting?”, to the extent that in the newer version of the paper, we have a whole section explaining what the difference is.

I think basically I would view it as a much more specific and precise prediction than catastrophic forgetting. So one difference is that we’re training on a subset of the data, and this is quite important because this rules out a bunch of other hypotheses that you might have about why grokking is happening.

So for example, if your hypothesis is: the reason grokking happens is because you don’t have the correct representations for the modular addition task, and once you find those representations, then you’ve grokked - that’s a little bit incompatible with then reducing the training data size and ungrokking, because you already had the representations and so you need this additional factor of a change in efficiency.

Or another example is a random walk hypothesis, where somehow you stumble upon the correct circuit by randomly walking through parameter space. And that also either doesn’t say anything about it, or anti-predicts ungrokking, because you were already at that point. So I think that’s quite an important difference.

I think going back to the difference between catastrophic forgetting [and ungrokking], I think another more precise prediction is that we’re able to predict the exact dataset size at which you see ungrokking, and it’s quite a phase-change-y phenomena or something. It’s not like as you decrease the dataset size, you’re smoothly losing test accuracy, in this case, which is more the kind of thing you might expect from traditional catastrophic forgetting.

Daniel Filan: Right. My impression was that the thing you were predicting would be that there would be some sort of phase change in terms of subset dataset size, and also that that phase change would occur at a point independent of the strength of weight decay.

Vikrant Varma: That’s right.

Daniel Filan: But I thought that you wereln’t able to predict where the phase change would occur. Or am I wrong about that?

Vikrant Varma: That’s right. Our theory is not producing a quantitative prediction of exactly what dataset fraction you should expect that phase change to happen at. That’s right.

Daniel Filan: Yep. But it does predict that it would be a phase change and it would happen at the same point for various levels of weight decay. One cool thing about this paper is it really is a nice prediction and you’ve got a bunch of nice graphs, [you] kind of nail it, so good job on that.

Vikrant Varma: Thank you.

Daniel Filan: But one thing I’m wondering about is: you have this phenomenon of ungrokking and it seems at least like an instance of catastrophic forgetting that you’re able to say more about than people have previously been able to say. But this offers an opportunity to try and retrodict phenomena, or in particular… I’m not an expert in catastrophic forgetting, but my understanding is that one of the popular approaches to it is this thing called “elastic weight consolidation”, where you basically have different learning rates per parameters, and you reduce the learning rate, so you reduce the future change in parameters for those parameters that were important for the old task. That’s one method, you might be aware of others. Does your view of grokking and ungrokking retrodict these proposed ways of dealing with catastrophic forgetting?

Vikrant Varma: I think not directly. I can see a few differences. I’m not aware of this exact paper that you’re talking about, but I think depending on the task, there might be different reasons why you’re getting forgetting. So you might be forgetting things that you memorized or you might be forgetting algorithms that are appropriate for that part of the data distribution. That’s one aspect of it.

I think a different aspect is that it’s not clear to me why you should expect these circuits to be implemented on different weights. So if the theory is that you find the weights that are important for that algorithm and then you basically prevent those weights from being updated as fast, so you’re not forgetting, then I think that is pointing at a disjoint implementation of these circuits in the network. And that’s not something that we are really saying anything directly about.

Daniel Filan: Gotcha. Yeah, I guess it makes sense that it would depend on the implementation of these circuits.

Another question I have is: in a lot of your experiments, like you mentioned, you are more interested in the steady state than the training path, except for just the initial prediction of grokking, I guess.

Vikrant Varma: To be clear, the paper deals with the steady state; I’m very interested in the training path as well.

Daniel Filan: Fair enough. So if I look at the ungrokking stuff, it seems like… So there’s this steady state prediction where there’s this critical dataset size, and once you’re below the critical dataset size, you ungrok and it doesn’t really matter what your weight decay strength was.

If I naively think about the model, it seems like your model should suggest that it should take longer for less weight decay because you have less pressure to… You care about the complexity, but you’re caring about it less, per unit of time. And similarly, that grokking should be quicker for more weight decay. I guess it’s a two-part question. Firstly, do you agree that that’s a prediction of this model? And secondly, did that bear out?

Vikrant Varma: Yeah, so I think this is a prediction of the model, assuming you’re saying that weight decay affects the speed of grokking?

Daniel Filan: Yes, and of ungrokking.

Vikrant Varma: Yeah, I think that is a prediction of the model. Well, to be fair, it is a retrodiction, because the Power et al. paper already shows that grokking takes exponentially longer as you reduce the dataset size, and I forget what the relationship is, but it definitely takes longer as you reduce the weight decay.

Daniel Filan: And does ungrokking take longer as you reduce weight decay?

Vikrant Varma: We don’t show this result in the paper, but I’m fairly sure I remember that it does, yeah.

Generalizing the results

Daniel Filan: Okay, cool. So I’d like to talk a bit about possible generalizations of your results. So as written, you’re basically talking about efficiency in parameter norm, where if you increase parameter norm, you’re able to be more confident in your predictions, but that comes at a penalty if you train with weight decay.

Now, as your paper notes, weight decay is not the only situation in which grokking occurs, and you basically hypothesize that there are other forms of regularization, regularizing against other forms of complexity and that there could be some sort of efficiency in those other forms of complexity that might become relevant.

I’m wondering, do you have thoughts on what other forms of complexity I should be thinking of?

Vikrant Varma: Yeah, so we’ve already talked about one of them, which is that circuits might be competing for parameters on which to be implemented. This is a kind of capacity constraint. And so you might think that circuits that are able to be implemented on fewer parameters, or using less capacity (however you define capacity in the network), would be more efficient. So I think some relevant work here is bottleneck activations: I think this is from “Mathematical circuits of transformers”, which is talking about other phenomena like superposition that you would get from constrained capacity.

So that’s one more notion of efficiency. I think possibly robustness to interference could be another kind of efficiency: how robust is the circuit to being randomly invaded by other circuits. Maybe also robustness to drop-out would be a similar thing here. And then I think there are things like how frequently does the circuit occur, which might be important… From a given random seed will you be able to find it?

Daniel Filan: Do you mean: what’s the probability mass of it on the initialization prior over weights?

Vikrant Varma: Yes. And also then on the kinds of parameter states that SGD is likely to find. So this is the implicit priors in SGD. There’s some work on implicit regularization of SGD and showing that it prefers similar kinds of circuits to what L2 might prefer, but probably it’s different in some interesting way.

Daniel Filan: Okay. If I think about the results in your paper, a lot of them are generic to other potential complexity measures that you could trade off confidence against. But sometimes you rely on this idea… in particular for your analysis of grokking and semi-grokking, you play with this math notion of: if I scale up some parameters in every layer of a ReLU network, that scales up the logits by this factor, and therefore you get this parameter norm coming off. And I think this is involved in the analysis of semi-grokking versus ungrokking, right?

Vikrant Varma: Yes.

Daniel Filan: So I guess the prediction here would be that maybe semi-grokking is more likely to occur for things where you’re trading off weight parametrization as compared to robustness to drop-out or stuff. Does that sound right to you?

Vikrant Varma: I think in general it’ll be very hard to observe semi-grokking in realistic settings because you need such a finely tuned balance. You need all these ingredients. You need basically two circuits, or two pretty distinct families of circuits, with no intermediate circuits that can do the task well between them. You need a way to arrange it so that the dataset size or other hyperparameters are causing these two circuits to have very, very similar levels of efficiency.

And then also you need it to be the case that, under those hyperparameters, you’re able to actually find these two families of circuits. So you’ll probably find the memorizing one, but you need to be able to find a generalizing one in time. And this just seems like quite a hard thing to happen all at once, especially the fact that in realistic tasks you’ll have multiple families of circuits that are able to do the training task to some degree.

Daniel Filan: So semi-grokking seems unlikely. I guess it does seem like the prediction would be that you would be able to observe ungrokking for the kinds of grokking that don’t depend on weight decay. Am I right that that is a prediction, and is this a thing that you’ve tested for yet? Or should some enterprising listener do that experiment?

Vikrant Varma: So to be clear, the experiment here is “find a different notion of efficiency and then look for ungrokking under that notion”?

Daniel Filan: The experiment would be “find an instance of grokking that doesn’t happen from weight decay”. Then it should be the case that: [you] train your data, get grokking, then train data on subsets of various sizes, and there should be a critical subset size where below that you ungrok and above that you retain the grokking solution when you fine-tune on that subset.

Vikrant Varma: Yeah, I think this is basically right, and our theory does predict this. I will caveat that dataset size may not be the right variable to vary here, depending on what notion of efficiency you’re using.

Daniel Filan: Well, I guess in all notions of efficiency, it sounded like there was a prediction that efficiency would go down as the dataset increased for the memorizing solution, but not for the generalizing solution, right?

Vikrant Varma: Yeah, that’s right.

Daniel Filan: As long as you believe that the process is selecting the most efficient circuit.

Vikrant Varma: Yeah, that’s right.

Daniel Filan: Which we might worry about if there’s… you mentioned SGD found different efficiency generalizing solutions, so maybe you might be worried about optimization difficulty. And in fact maybe something like parameter norm is easier to optimize against than something like drop-out robustness, which is less differentiable or something.

Vikrant Varma: Yeah, I think that’s right. I think you’re right that in this kind of regime, dataset size is pretty reasonable. I was imagining things like model-wise grokking, where on the X-axis, instead of amount of data, you’re actually varying the size of the model or the number of parameters or the capacity or whatever.

But all of these different models are actually trained on the same amount of data for the same time. And it’s also less clear how exactly you would arrange to show ungrokking there because naively, you can’t go to a higher-size model and then reduce the size of the model. But maybe there are ways to show that there.

Vikrant’s research approach

Daniel Filan: Gotcha. So if it’s all right with you, I’d like to move on to just general questions about you and your research.

Vikrant Varma: Cool.

Daniel Filan: So the first question I have is: we’ve talked about your work on grokking, we’ve talked about your work on latent knowledge in large language models. We haven’t talked about it, but the other paper I know you for is this one on goal misgeneralization. Is there a common thing that underlies all of these that explains why you worked on all of them?

Vikrant Varma: Well, one common thing between all of them is that they are all projects that are accessible as an engineer without much research experience, which was one of my main selection criteria for these projects.

So yeah, I guess my background is that I’ve mostly worked in software engineering, and then around 2019 I joined DeepMind and about a year later I was working on the alignment team. I did not have any research experience at that time, but I was very keen to figure out how you can apply engineering effort to make alignment projects go better.

And so certainly for the next two or three years, I was mainly in learning mode and trying to figure out how do people think about alignment? What sorts of projects are the other people in the alignment team interested in? Which ones of these look like they’re going to be most accelerated by just doing good engineering fast? And so that’s where a bunch of the early selection came from.

I think now I feel pretty drawn to working on maybe high risk, high reward things that might end up mattering if alignment by default (as I see the plan) doesn’t go as expected. It feels like the kind of thing that is potentially more neglected. And maybe if you think that you need a bunch of serial research time to do that now before you get very clear signals that, I don’t know, we haven’t done enough research on some particular kind of failure mode, then that feels important to do now.

Daniel Filan: Okay. So should I be thinking: lines of research where both, they’re approachable from a relatively more engineering-heavy background, and also laying the foundation for work that might come later rather than just attempting to quickly solve a problem?

Vikrant Varma: Yeah, that’s right. That’s certainly what I feel more drawn to. And so for example, I feel pretty drawn to the eliciting latent knowledge problem. I think there is both interesting empirical work to do right now in terms of figuring out how easy is it to actually extract truth-like things from models as we’ve been discussing, but also framing the problem in terms of thinking about methods that will scale to superintelligent systems[, this] feels like the kind of thing that you just need to do a bunch of work in advance. And by the time you’re hitting those sorts of problems, it’s probably quite a bad situation to be in.

Daniel Filan: Gotcha. So what should I think of as your role in these projects?

Vikrant Varma: I think it varies. I would describe it as a mix of coming up with good research ideas to try, trying to learn from people who have been around in the alignment community much longer than me, and also trying to supply engineering expertise

So for example, currently I’m working on sparse autoencoders for mechanistic interpretability, and I am very new to mechanistic interpretability. However, all of the people I work with (or many of the people I work with) have been around in mech. interp. for a long time. And it’s great for me to understand and try to get knowledge directly from the source in a way.

I think at the same time, with sparse autoencoders in particular, that’s the kind of project where… Partly what drew me to it was Chris Olah’s tweet where he said… I’m not sure exactly what he said, but it was something like “mech. interp. might be in a different mode now where if SAEs [Sparse AutoEncoders] work out, then it’s mostly an engineering problem, it’s not so much a scientific problem”. And that kind of thing feels very exciting to me, if we’re actually able to scale up to frontier models.

Daniel Filan: It could be. I do find myself thinking that there’s still a lot of science left to do on SAEs, as far as I can tell.

Vikrant Varma: Yeah, I don’t disagree with that.

Daniel Filan: Perhaps I should say for the listener, a sparse autoencoder - the idea is that you want to understand what a network is thinking about. So you train a function from an intermediate layer of the neural network to a very large vector space, way more dimensions than the underlying thing, and then back to the activation space, and you want to train this to be the identity function, but you want to train it so that the intermediate neurons of this function that you’ve learned very rarely fire. And the hope is that you’re picking up these underlying axes of variation, and hopefully only a few of them are happening at a time, and hopefully they correspond to concepts that are interpretable, and that the network uses, and that are underlying facts about the network and not just facts about the dataset that you happen to train the autoencoder on.

And all three of those conditions seem like they need more work to be established. I don’t know, I’m not super up to date on the SAE literature, so maybe somebody’s already done this, but I don’t know, that’s a tangent from me.

Vikrant Varma: I definitely agree. I think there’s a ton of scientific work to do with SAEs. It just also happens to be the case that there’s… It feels like there’s a more direct path or something to scaling up SAEs and getting some sort of mech. interp. working on frontier models that, at least in my view, was absent with previous mech. interp. techniques, where it was more…

Daniel Filan: Human intensive, I guess?

Vikrant Varma: Yeah, more human intensive and a much less direct path to doing the same kind of in-depth circuit analysis on larger models.

The DeepMind alignment team

Daniel Filan: I’d next like to ask about the alignment team at DeepMind. So obviously I guess you’ve been there for a few years.

Vikrant Varma: Yeah.

Daniel Filan: What’s it like?

Vikrant Varma: It is honestly the best place I’ve worked. I find the environment very stimulating, there’s a lot of freedom to express your opinions or propose research directions, critique and try to learn from each other. I can give you an overview of some of the projects that we’re currently working on, if that helps.

Daniel Filan: Yeah. That sounds interesting.

Vikrant Varma: So I think the team is roughly evenly split between doing dangerous capability evaluations, doing mechanistic interpretability, rater assistance, and various other emerging projects like debate or process supervision. So that’s roughly the split right now.

I think apart from that, to me it feels like an inflection point right now because safety is getting a lot of attention within DeepMind, I think. So Anca Dragan recently joined us, she is a professor at Berkeley. To me it feels like she has a lot of buy-in from leadership for actually pushing safety forward in a way that feels new and exciting. So as one example of this, we’re spinning up an alignment team in the Bay Area. Hopefully we’ll have a lot more resources to do ambitious alignment projects in the future.

Daniel Filan: Sure. Is that recruiting new people from the Bay Area or will some people be moving from London to seed that team?

Vikrant Varma: That’s going to be mostly recruiting new people.

Follow-up work

Daniel Filan: Gotcha. The second to last thing I’d like to ask is: we’ve talked about this grokking work and this work on checking this CCS proposal. Do you have thoughts on follow-up work on those projects that you’d really like to see?

Vikrant Varma: Yeah, definitely. So I think with the grokking paper, we’ve been talking about a bunch of potential follow-up work there. I think in particular, exploring other notions of efficiency seems really interesting to me. I think the theory itself still can produce quite a lot of interesting predictions. And from time to time I keep thinking of new predictions that I would like to try that that just don’t fit into my current work plans and stuff.

So an example of a prediction that we haven’t written down in the paper but that occurred to me a few days ago, is that: our theory is predicting that even at large dataset sizes where you’re seeing grokking, if the exponent with which you convert parameter norms into efficiency is small enough, then you should expect to see a non-zero amount of memorization even at large dataset sizes. So the prediction there is that there should be a train-test gap that is small but not zero. And this is in fact true. And so the thing you should be able to do with this is use the empirical estimates of memorization and generalization efficiency to predict train-test gap at any dataset size.

So that’s one example. I think this theory is pretty fruitful and doing work like this is pretty fruitful. I would love to see more of that. On the CCS side, a thing I would love to see is test beds for ELK methods. So what I mean by that is examples of networks that are doing something deceptive, or that otherwise have some latent knowledge that you know is in there but is not represented in the outputs. And then you’re really trying your hardest to get that latent knowledge out, using all sorts of methods like linear probing or black box testing, maybe anomaly detection. But I think without really good test beds, it’s hard to know. It’s easy to fool yourself about the efficacy of your proposed ELK method. And I think this is maybe quite related to the model organisms agenda as well.

Daniel Filan: Well, I think that wraps up about what we wanted to talk about. Thanks very much for being on the show.

Vikrant Varma: Yeah, thanks for having me.

Daniel Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. Filming occurred at FAR Labs. Financial support for this episode was provided by the Long-Term Future Fund and Lightspeed Grants, along with patrons such as Alexey Malafeev and Tor Barstad. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

Published on April 11, 2024 9:30 PM GMT

YouTube link

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don’t want to cause harm or take over the world—or in other words, ensuring that they’re aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called “AI control”: ensuring that AI systems couldn’t take over the world, even if they were trying to.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode, I’ll be speaking with Buck Shlegeris and Ryan Greenblatt. They work at Redwood Research on the project of AI control: that is, averting risks from AIs that intentionally subvert safety measures. For links to what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net. All right, Buck and Ryan, welcome to the show.

Buck Shlegeris: Exciting to be here.

Ryan Greenblatt: Exciting to be here as well.

What is AI control?

Daniel Filan: Cool. So I guess we’re going to be talking about the control approach to AI safety (or something), that you two have written about. For people who haven’t done background reading, what is the control approach?

Buck Shlegeris: AI control is an approach to managing intentional risk posed by models. In particular, it’s an approach to managing risks that arise from models intentionally subverting your attempts to ensure and evaluate safety. The core idea of AI control is: here are two different reasons why you might be safe from models intentionally causing problems. One is that the models might not try to intentionally cause problems, and the second argument is you might have it be such that even if your models try as hard as they can to cause problems, they’re unable to do so. And AI control is the second class of approach to mitigating problems with models intentionally subverting your safety techniques. And we think that this approach has the particularly nice property that it’s a lot more methodologically simple to think about and ensure than trying to make sure your models actually just don’t at all try to subvert your safety techniques.

Daniel Filan: What makes it easier to think about?

Ryan Greenblatt: It’s easier to imagine specifically what goes wrong and to discuss specific threat models, whereas for example, it’s hard to know exactly how AIs internally represent intentions or what would make their intentions change. And so it’s very unclear if you, for example, train on better feedback: will that make it less or even more likely that your AI would try to do things that are actively against your interests? And it’s just pretty unclear what the sign on various interventions are or how strong various interventions will be for the other class of non-control things, which is often called alignment.

Daniel Filan: Okay, so should I basically be thinking that in control, we’re trying to think about objects, like can somebody break through this wall, or can somebody write sketchy code without me being able to check, whereas for alignment, I have to think about neural net generalization and how desires are encoded, and hopefully, the first class of things are just easier to think about?

Buck Shlegeris: That’s a good way of putting it. There’s basically a classic set of arguments about why AI alignment is hard to ensure, which is: no matter how your AI is behaving, you have to worry that it’s just behaving that way because it thinks it has some way of looking good now in the hopes of causing problems later. And basically, any strategy which tries to ensure safety via doing some set of interventions to make a model be aligned, especially if those only involve looking at the input/output behavior of the model, run into this kind of problem: “Well, how do you know that it’s not just looking that way on purpose in the hopes of causing problems later?” This, in fact, seems quite intractable and difficult. And it seems to us that given the current state of technology on AI safety, if you want to minimize this type of risk, it’s by far easier to just assume that that is going on and try and render yourself safe anyway.

Once we started thinking about things in these terms, we quickly realized that a lot of the work that happens in AI safety is actually very well thought of in the control frame anyway. That is, a lot of techniques that people apply implicitly can be thought of as strategies to ensure that even if your models are scheming against you, you end up with okay outcomes. But we think that, inasmuch as that’s in fact what you’re trying to do with your safety techniques, you’re a lot more likely to analyze them correctly if you think about them explicitly in this adversarial frame, where you’re asking about whether there’s any strategy that your models are able to employ that would allow them to cause problems.

Daniel Filan: I actually want to ask another question about that. One thing that’s similar to control, or at least the kinds of thing I see you doing, is alignment via adversarial training. So you put your AI in a situation where maybe it can do something sort of like taking over the world or whatever, and if it actually does it, then you give it reward negative 10 or something and try to gradient it away. How do you think of the control approach broadly as comparing to that?

Ryan Greenblatt: A technique like adversarial training could be used either from a control perspective or could be used to ensure an alignment safety condition. These are just different post-conditions, or different conditions you’re aiming to ensure in order to ensure safety. So for ensuring control or for trying to achieve control, the hope would be that you adversarially evaluate your AI, and then if you catch evidence of extreme egregious misalignment, then you would reconsider and decide what to do next. And in fact, it seems pretty scary to have an AI that you know was egregiously misaligned in the past and you’ve trained it no longer to do that in adversarial training. [For] such an AI, it’s not clear that you’ve necessarily made it aligned versus just made it better at hiding or better at avoiding adversarial training.

For both the hope of ensuring that your AI ends up aligned or ensuring that your AI has no way of causing problems, you acquire the same property, which is that there is no strategy the AI could follow to cause problems in deployment, or in actual usage of the AI, that you wouldn’t have caught in adversarial training. And this seems like naively a pretty hard property to ensure for a variety of reasons. And because of that, the actual conditions in which adversarial training—or if you don’t train against it, adversarial evaluation—seem useful… I mean it’s going to depend a lot on the details, but it’s definitely not an obvious sufficient solution.

Daniel Filan: I also would like to ask you about the contrast… I feel like 10 years ago, a bunch of AI safety people talked a lot about AI boxing: putting AIs in a situation where they couldn’t really do that much. How do you think control is conceptually similar or distinct to that?

Ryan Greenblatt: I think in some sense, you could just rename “control” to “boxing”, but we think that name is much worse because boxing implies you’re doing just mechanisms that restrict the AI’s affordances, whereas I think a really important class of interventions is observing the AI over a wide range of situations and using that to rule out certain problematic things. So for example, if you oversee or have humans observe 1% of the AI’s actions, and that looks fine in those 1% of cases, and you just do this on an ongoing basis, then that’s sufficient to ensure that the AI can only do things that a human would catch quite rarely. And it’s not clear this is… I think when people think of boxing, this is not naturally what people will think of.

Another key aspect of control that I think is not very naturally implied by the word “boxing” is that using weaker AIs (that we know aren’t smart enough to be seriously misaligned) to do monitoring, or to do other tasks, might be really key. More generally, I think when people say boxing, they are implicitly assuming a very powerful system that is extremely, extremely limited, whereas we’re often imagining systems that are much more comparable in capabilities in various ways to humans, but which are actually given a pretty broad set of affordances, which seems like it will be necessary to get your AIs to do a lot of work, though you’ll still want to, of course, limit things so that they can’t cause problems.

Daniel Filan: Maybe now’s a good time for you guys to say a bit more concretely: what’s an example or an instance of something that fits into the control approach?

Buck Shlegeris: Here’s an example. Maybe I’m just going to say a rough sense of what I would like AI labs to commit to, or a simple example of some commitments that AI labs could follow, which I would think of as an example of them trying to minimize risk from models intentionally subverting safety mechanisms using control. So here’s basically the strategy that the AI lab would follow: they would say, “Here is a capability evaluation that we’re going to use to decide whether models are capable of scheming against us.” So for example, right now, it seems to me like GPT-4 is incredibly unlikely to be scheming against us. And this is basically just because I know GPT-4. There are only 10 people in the world I talk to more than I talk to GPT-4, and there’s just no way that it’s playing the long game. There’s just no way that its reasoning process behind behaving so well as a loyal assistant to me is just being like, “Well, how am I eventually going to grab the universe?”

Ryan Greenblatt: Why do you think that?

Buck Shlegeris: Because among other things, when you ask it questions about what it would do if it wanted to grab power, its answers are really [redacted]. When you ask it questions about what it would do in order to grab power, its answers are not very smart. It seems to be pretty poor at reasoning through a lot of these things, and its quality of answers to those questions is comparable to the quality of its answers to a lot of other questions that are less obviously AI-scheming related. I wish that I had a better answer to this question and I think this is an important topic for future research.

Daniel Filan: Let’s just say we take it on faith that GPT-4 is not trying to scheme to grab power or anything.

Buck Shlegeris: So here’s what we want to do. We want to have a set of capability evaluations we can apply to all models as they’re produced that tell us whether we are willing to bet the future on that model not scheming against us. Now, let’s use the word “trusted” to describe models that are too incompetent for us to be worried about the possibility that they’re scheming against us. Now, the control ask is as follows: the AI lab ensures that it is never the case that there would be a high probability of a bad outcome conditional on all models that are capable of scheming against them, in fact scheming against them with the best scheme that they could come up with.

So this would require them to… For example, if they were hoping to deploy future AIs as autonomous researchers inside their lab, which some people seem to think is a basically default plan. So basically, imagine you’ve got ChatGPT with Code Interpreter, except it’s a recursive ChatGPT that can ask questions to its own ChatGPT browser window, and it’s running on some dev machine and it’s doing research problems and talking to researchers once every hour about its progress on the things that it’s doing. If the models there were given the same level of permissions as employees of large labs are currently given, then if that model was trying to exfiltrate its weights, it would probably succeed. And the control obligation is that the lab is obligated to ensure that it’s extremely unlikely that the models would be able to exfiltrate their weights even if all the models that were capable of scheming were trying as best as they could.

The reason why I centered on “exfiltrate their weights” there is because I think that in computer security, people talk about security properties, which are things that you try to ensure at all times. So it’s obviously not literally a catastrophe for AIs to exfiltrate their weights, but when you want to argue that an AI is not going to cause catastrophe, you should probably have “the AI weights never leave my server” as a security property that you ensure and you say, “Well, I’m going to be safe because nothing bad can happen if the AI’s weights are still controlled on my server and also they’re not going to get out into the outside world.”

Daniel Filan: Do you think that lack of weight exfiltration is a sufficient safety property?

Buck Shlegeris: I don’t think that preventing model weight exfiltration is a sufficient safety property. One example of a threat model that is not model weight exfiltration, that seems like it would be very bad, is if the model totally hacked the data center in which it’s running. “Hacking the SSH Server” is the title of the Alignment Forum post I wrote about this last year. I think that if you can ensure that the AIs are under your control, then it will probably be relatively easy—for models that are not wildly superhuman—to argue that they won’t be able to pull off anything egregiously bad.

Ryan Greenblatt: The other core threat models you have to worry about [are]: if you’re using those AIs for a lot of work, you need to argue that that work won’t be sabotaged in a way that’s catastrophic. And if that work is safety critical and sabotage could cause huge problems, then you have to be concerned about that. And I think Buck is implicitly claiming that that side of the equation maybe seems easier, though I’m not so sure that’s true in practice.

Daniel Filan: So by sabotage, should I imagine something like: the AI is doing AI alignment research, but it’s secretly doing bad AI alignment research that will produce a sketchy AI?

Ryan Greenblatt: Yeah. So you can imagine that it does AI alignment research that means that when you apply that research, you’ll get a future AI that’s aligned to your current AI rather than aligned to humanity. You could imagine that that research is just bad and does nothing of value, maybe even…

Daniel Filan: Gotcha. Okay, so we’re imagining labs have some set of security properties that they think is enough, and they are going to try to prove that they can just maintain these security properties if the AIs are scheming against them and trying to subvert them with all their capabilities?

Ryan Greenblatt: That’s right, yes.

Buck Shlegeris: And in particular, the argument is probably going to look like the lab saying, “It’s very unlikely that the AI would succeed at violating one of these security properties before we noticed particularly strong evidence that it was trying to do so.”

Daniel Filan: Okay, so there’s going to be some protocol for trying to notice if an AI is going to do something sketchy, and then there’s another protocol for what to do once you catch an AI doing something sketchy?

Buck Shlegeris: Yep, something like that.

Daniel Filan: Okay. Maybe it’s not quite the focus of the whole thing, but maybe we can say a little bit about what the protocol is for “you catch an AI doing something sketchy”. One decent baseline is “stop using that AI”. Does that sound right to you?

Ryan Greenblatt: Yeah. I think that the baseline is just: stop and reconsider all your life choices, report to the world what’s going on, because there are probably other people using their AIs, which are in a roughly comparable state, and knowing that some AI was plotting against us seems pretty worthwhile to know. And then my hope there would be [that] the baseline is: the world is informed, the world does something sensible that looks like really figuring out what’s going on and then making a careful decision about when to resume building powerful AIs. I think that that might be overly optimistic—I don’t feel strongly either way—but if that fails, then we think that catching an AI like this gives you a number of better options for continuing, which you didn’t have before catching the AI, even if you have to continue deploying such an AI.

So for example, you can study how well your countermeasures worked in that exact situation and see how close the AI was to succeeding. You can analyze what the AI’s intentions were and what its plans are for escape, at least in that one instance, and get some sense of what’s going on. You can potentially even model what it would do upon having escaped, because you can play out the scenario as though it went well and did not get caught, and then pretend as though it’s running virtually in the outside world when it’s actually running in a sandbox of some sort and see what it would do. You can study the internals of the model while it’s doing this. And insofar as you had any sort of machinery for manipulating that, you could apply it to this instance, which is an example of a case you really haven’t seen before of an AI doing something that’s very egregiously bad, very intentionally, based on a long thought-out plan, or a thing that’s been lying in wait for a long time.

Protocols for AI control

Daniel Filan: There’s a lot to talk about there, but for the moment, I’d like to bring us a bit back to protocols for maintaining control. My understanding is: your proposal is to basically come up with some sort of control protocol and evaluate it with a red team that’s trying to do their best to mimic an evil AI. And if the red team can’t succeed, then you think things are going well. Is that roughly right?

Ryan Greenblatt: Yeah, that’s basically right. So I should note that the red team’s goal… they should think of themselves less as trying to imitate an evil AI and more trying to imitate what are the worst things that could happen or what are the best strategies by an evil AI. So I’d prefer the red team to be less in the mindset of like, “What are the sort of plans an evil AI might think of,” and more in the mindset of like, “What are the best plans that an AI could follow,” in the same way that security researchers or cryptography researchers aren’t trying to game out the psychological biases of potential attackers and are instead just like, “What would actually be the best strategies?”

Daniel Filan: Okay. Just to give people some context, can you illustrate the kind of protocol you’re imagining?

Ryan Greenblatt: A simple baseline protocol we often think about is you-

Buck Shlegeris: To interrupt, can you say more about, before you say safety measures, what you’re imagining the AI doing and what you imagine its affordances being? Probably you’re imagining a Code Interpreter-like thing.

Ryan Greenblatt: Yeah. So we often imagine that at the point where these sorts of interventions become really critical, and AIs might be very useful, AIs are doing massive amounts of autonomous research, or at least that seems like the default. So we have AIs autonomously running code, seeing the outputs from that code and doing way more labor than humans within the AI lab are doing at this point.

Buck Shlegeris: And just to spell out, in terms of software architecture, how this would work: you have some AIs that are running on an inference server, and they’re writing code and doing research on some dev machines, and then there’s a third set of machines that are orchestrating this, that are running a scaffold, that repeatedly, ChatGPT Code Interpreter-style, ask the model for a next action by sending a request to the inference server and then getting a response back. And then: suppose that that response was, “I now want to run the following code,” then they send that code to the dev machine and run that code there and they get some output and that comes back, and then the scaffold maybe decides the next thing to do is run some more code on the inference server.

And I think the fact that we’re imagining a… I think this is a very realistic way for things to be set up. I’m not describing something which is particularly paranoid: this is probably how Code Interpreter works right now. But I think that picturing it in terms of different machines, where information’s being sent back and forth, will make some of the security measures we’re describing more obvious. So what were you saying about the security measures?

Ryan Greenblatt: And to be clear, the inference machine only runs the model weights and just takes in queries and returns responses, whereas we have a dev machine that actually runs code, and then we have some orchestrator machine that’s just communicating back and forth with both of those.

One baseline measure we often imagine is that you have some weaker model that is “trusted”, in the terminology that Buck was using before, which means that it’s not capable enough, or it’s sufficiently incapable, that it’s implausible at scheming. You can imagine GPT-4 for this trusted model. We take that trusted model and we monitor what our smarter AI is doing, and then we just basically ask it or train it to report on, “Is the thing it’s doing sketchy and problematic?”

Buck Shlegeris: So in particular, for any action, for any code that our untrusted model was proposing taking, before we send that to the dev machine to actually execute it, we first send a query to an inference server running our trusted model saying, “Hey, was this code sketchy?” And so notably, we ask this question before the code is ever actually executed and before it has any effect on the dev machine.

Daniel Filan: Okay, so your trusted model basically checks in on whether your untrusted model is doing anything sketchy. And what if it is?

Ryan Greenblatt: If it is, then the hope would be that we would report that to a human. A human would look in somewhat more detail. And if the human thinks that things are problematic, they would escalate further, which could involve… Well, first of all would involve not running that command, but also could involve pausing other AIs, and then could escalate further into shutting down AIs and analyzing the situation in further detail until we conclude that the AI is doing egregiously bad actions intentionally. And a key thing here is that human labor is scarce. So we cannot report all actions to humans, that wouldn’t work, but we can report some fraction of actions, because we’ll have some ability to have humans who understand what’s going on to some extent, or at least to understand programming, look at the actions and report that. And I think for an intuition pump, you should be thinking more like 0.1% of actions are reported, [rather] than 0.0001% or 10%.

Buck Shlegeris: I often imagine the workflow of: you’re in chat.internal.openai.com or whatever as an OpenAI researcher, and each of the little tabs or whatever is a chat with some autonomous research assistant or whatever. And every so often, one pings up with a little red exclamation mark, which is like, “One of the commands that was proposed to be run by this autonomous research assistant got flagged for looking a little sketchy. You need to look this over right now to make sure that it’s okay to run.” That scaffold is just going to be pausing. That instance is not going to be doing work on its little research project until you go in and approve this action and say, “Oh no, this was actually totally fine.”

Daniel Filan: Okay. So broadly, we’re using some mix of humans and trusted but a little bit dumb AIs to try and catch really smart AIs trying to do nasty stuff?

Buck Shlegeris: Yeah. Where by “really smart”, it might be… This gets into futurism question over which problem we’re trying to solve, which situations we’re trying to ensure are safe. But it seems to us that a lot of x-risk is posed by models that are not wildly superhuman, or perhaps well-described as roughly human-level in capabilities. And so a lot of the time, what we’re trying to talk about is: how do you make it be safe to have a lab where you have some AIs that are roughly human-level in their abilities, but are doing a huge quantity of labor, such that in absence of special countermeasures, they’d outnumber you and so be able to cause huge problems for you?

Which AIs are controllable?

Daniel Filan: Yeah. I actually want to go a little bit more [into] this. So the kinds of AIs that we’re expecting control to apply to, they’re roughly human level: can you say a little bit more about what that means?

Ryan Greenblatt: A key guarantee, or a key property we’ll need to ensure that control can work, is [that] the AIs can’t do actions which humans think look fine, but which have completely unrelated consequences that are extremely difficult to predict. So for example, the AIs can’t write code which looks like a totally normal piece of code, but in practice, does some completely insane vulnerability that a human would never have conceived of. And to be clear here, when I say, “The human wouldn’t notice,” I mean after reviewing it in a lot of detail for a long period. The claim is: it’s (at least in principle) possible for AIs to be wildly superhuman, in the sense that they can do things that we never would’ve seen coming, and control will be very unlikely to work in that case, at least with current technology. But we think it could work in cases where the AIs are qualitatively similar to humans in capabilities, or perhaps somewhat smarter in some domains, somewhat weaker in other domains, but they can’t do anything that’s completely incomprehensible.

Daniel Filan: Okay. So we’ve got these AIs that can’t do things which have effects that we wouldn’t be able to figure out even if we spent a bunch of effort trying to look into it, but they are kind of smart. They are sort of like humans, but maybe they’re a bit smarter, and also there are a ton of them and you can run them for 1 cent per hour or something. Is that what I should be thinking?

Ryan Greenblatt: Yeah, I think you should be thinking cheap. It’s somewhat unclear if they’ll be qualitatively smarter or dumber than humans, because they could be really useful even if they’re somewhat dumber than humans, because they’re so cheap and so fast. But they might be smarter than humans in some domains, for example, because we don’t expect the capability profile to be the exact same.

Daniel Filan: Okay. So in order for this approach to be interesting, it seems like the AIs, on the one hand, they can’t be that smart that they can just outsmart humans, but they also have to be capable enough to (a) be able to do really valuable research, and (b) they’ve got to pose some sort of risk that we might otherwise be worried about. I guess I’m wondering how I should think of how risky these sorts of things are and how much work we could plausibly get out of them?

Buck Shlegeris: The way that we talk about this is arguing that: the early models that are transformative and dangerous by default… To say what I mean by those two: by “transformative”, I mean it’s going to be a huge deal if you’re able to harness their labor, and by “dangerous by default”, I mean if you don’t do anything to prevent them causing huge amounts of problems, they would have the ability to do so. I want to claim that systems with those two properties are controllable, by which I mean it’s tractable to use control to drastically reduce the risks associated with using them.

And as you basically alluded to, the reasons why we care about these two properties of transformative and dangerous by default is, well, if they’re not dangerous by default, obviously, there’s no safety problem here. If they’re not transformatively useful but they are dangerous by default, then obviously, you should just not touch them: there’s just no upside that’s possibly commensurate and all of our efforts should just go into telling people to not turn on these doom machines with no upside.

So yes, our claim is in fact that systems with those two properties are reasonably likely to be controllable. I think the simplest intuition pump for this is that if the systems were like humans, then I think this property would roughly hold. And that itself is not a valid argument at all. But when you think through the possible defeaters in, say, a more specific version of the argument, I think the core of it roughly holds up. The basic pitch is: suppose we had access to a huge amount of potentially egregiously misaligned human labor. I think it would be transformatively useful to have human labor that was very cheap.

Ryan Greenblatt: And fast-running.

Buck Shlegeris: And fast-running, potentially. I think it would be very dangerous by default. Many takeovers of things have happened in human history, and in all of those cases, the perpetrators were humans, indicating an existence proof for humans causing problems. But also, when we think through the control situation, whether humans seem controllable in these kinds of cases, the answer really looks like yes, from our perspective.

Daniel Filan: Why does the answer look like yes?

Buck Shlegeris: Well, basically, it seems really hard for… When we think through the kinds of tasks that you might really want your AIs to be doing, when we think through how hard it would be for a bunch of humans to, while being put to work on those tasks, cause catastrophes to occur, especially the escaping or hacking the data center ones that we were talking about earlier, it feels to us that the kinds of countermeasures that seem employable probably make the situation very hard for the AIs.

Daniel Filan: Okay. Suppose I am a member of technical staff at OpenAI, but I’m evil: what do you do to stop me from doing sketchy, fake AI alignment work that just gives me all of the money or something?

Ryan Greenblatt: Maybe there’s a few different failure modes to distinguish between. One of them is if we imagine this sort of misaligned human running on our machines, we should think [of] it like it’s got some weights somewhere, let’s just say it’s an emulated mind or something, and if it copied its weights out into the internet, that might be really problematic for us. Or if it was able to run those weights in an untracked way inside of our data center, as Buck brought up earlier, that would also be a problem. And there’s another thing which is: it sabotages our work, or otherwise causes the work that we’re using it to do to be very problematic.

So as far as why it might be possible to prevent such a system from running or from exfiltrating its own weights: well, we’re going to need to give it the ability to run code and do a lot of open-ended tasks, but we also get to watch everything it runs ever and all the inputs it ever sees. And if we see something suspicious, we can investigate further. We can do things like reset the memory of these systems, fork them, and potentially give them precise instructions on how they’re supposed to interact with the world. So for example, we can have them delegate subtasks to other systems, we can have them have difficulties communicating with other systems that would be necessary for their overall plan. An example we often think about is: we’ll have one AI that’s writing the code and another AI that gets to use the code. And then a threat model you can be worried about is: there will be vulnerabilities in the code that will then be exploited. But if you have different AIs using the code versus writing the code, then it might be more difficult to ensure that the vulnerabilities that are inserted get exploited.

Buck Shlegeris: Even if both of those AIs are untrusted models.

Ryan Greenblatt: Yeah.

Preventing dangerous coded AI communication

Daniel Filan: Okay. So I guess if I’m imagining what do I do in this situation, it seems like it really helps if I’m able to just keep track in my head of a bunch of stuff. For instance, if I have a little coded language that I can use to myself that other people can’t figure out what it means, but lets me leave messages to myself—or people who are basically clones of me, because I’m imagining most of the AIs that are being deployed are just forks of a base model with a little bit of fine tuning, maybe. If I’m able to keep state in my head, figure out a better sense of what’s going on, talk to people who are basically like me without other people being able to figure it out, in that situation it seems like I might be able to get a lot done. Does that sound right to you? And if so, how do you stop that from happening?

Ryan Greenblatt: I think if the AIs are allowed to be spending a reasonable fraction of their time building up state and figuring out how to cause problems, then that’s going to be an issue, because we have a huge number of AIs running a massive number of copies, and that means that in aggregate the total amount of work they can be doing to cause problems could be really big. Our basic hope for avoiding this will be to restrict their communication in such a way that it’s difficult to communicate encoded information. So an example would be when they send messages to each other, the messages have to be in natural language and we’ll paraphrase them with a weaker model, and also have humans review some subset of the messages and verify that the messages don’t seem to have coded messages, as far as we can tell.

And you can potentially run experiments like: a human looks at a message, they rewrite it in some way that they think should be semantically equivalent, and then they see how that message affects the behavior of the AI that [it] would get sent to and see if that’s suspicious. And you can do all kinds of experiments along these lines that seem like it intuitively makes it very difficult for AIs to get away with this sort of encoded information passing without getting caught.

Daniel Filan: I wonder if that would nerf or severely damage the AI’s ability to be useful. If I just think about myself, if [you] ask me, “Hey, Daniel, can you do a bunch of research and coding? Also, you’re not going to be able to think for more than five minutes straight without interruption, and whenever you take notes to yourself, we’ll take those notes and we’ll decide what we think they mean and paraphrase them, and future you will get our paraphrase of your notes.” I think I would be way less productive. So am I going to be able to do anything?

Buck Shlegeris: Yeah, so the way I think about this is: almost all of the things that the AI is doing couldn’t possibly lead to a bad outcome, basically. So for example, suppose the AI is doing math research. It just thinks about math, trying to prove some theorems in text, and then it wants to write some arXiv paper and send that around to some AI buddies. I’m okay with that AI doing all its thinking without very much interference if it’s doing the equivalent of a couple of months of intellectual labor. I might want to do some anti-steganography techniques of rewriting that paper at the end in order to make sure that it wasn’t actually doing a bunch of coup science that it’s secretly encoded in this arXiv paper.

But there’s basically no problem with that model thinking in whatever way it wanted to in the course of actually doing that work. The risks are mostly, it seems to me, that the actions that the AIs are taking that lead to extreme risks are somewhat concentrated in a couple of places. And it seems not crazy to imagine that the places where you have to insert information bottlenecks are not overall that bad for the total productivity of the AIs, especially given that you can train them to get around some of these problems.

Daniel Filan: Am I right that that statement is partly relying on this assumption that the AI is not so much crazily smarter than you that it can think of ways to hack the universe just by writing a math paper?

Buck Shlegeris: Definitely.

Daniel Filan: If it’s hacking the world, it’s because it typed a command into a shell or something.

Buck Shlegeris: Yes. My best guess is that there’s a lot of risk. The margin on which we are trying to intervene is a bunch of people make a bunch of autonomous-ish AI researchers that run in their data center and have the basic same rights as employees to just take actions. We are trying to make sure that a lot of the risks that would’ve occurred in that situation don’t happen. It seems to us that many of the risks are well-described as things that basically look like that, where you didn’t treat your AIs as adversaries. And it seems like mitigating those reduces a substantial share of risk.

Daniel Filan: I have two questions. The first one I want to ask is: even in the case of “AI thinks for a while, writes a math paper, distributes it to a bunch of people”, if I imagine that the AI is significantly smarter than me… Let’s say I’m a human employee at OpenAI at this time. If there’s this math paper, my job is to rewrite it so that it’s not encoding any hidden information… I don’t have that much faith in my ability to rewrite complicated papers and preserve all the interesting content.

Buck Shlegeris: I have a lot of faith in your ability to take a math… If it’s a math paper, it seems reasonably chill.

Ryan Greenblatt: You can also experiment.

Buck Shlegeris: Yeah, this is an extremely resolvable empirical question. We can have a red team that tries to take math papers and steganographically encode information in them and then measures what the bit rate is per paper given that some paraphrasing procedure is applied. We, in fact, have a paper where we did that, using the exclusive “we”, as always at Redwood Research. Maybe Ryan, you want to say more about this?

Ryan Greenblatt: We had models respond to chat queries and used various procedures to encode messages into that. And then we paraphrased them and saw how much this degraded quality and how much it reduced the bit rate. And we basically found that with current steganography methods that models could implement, even given the ability of those models to run code and change their own sampling procedures, paraphrasing works, I would say, extremely well. Though it’s going to depend on the domain, and it’s going to depend on the gap between the model you have paraphrasing versus the model that produced the output. For example, in this math paper case, maybe current models would be too dumb to do this paraphrasing, but humans could do it, for example. Though I also think it’s plausible that you could remove a substantial amount of information with minimal loss in quality with usage of GPT-4 that’s carefully trained to do this task.

Daniel Filan: Should I think that part of the approach is just checking if you can do steganography and if it ruins things?

Ryan Greenblatt: Yeah, so indeed for our control evaluation, or evaluation whether or not we can ensure control, one of the key questions is going to be “can AIs pass bits around in ways which we wouldn’t like them to do?” And if so, that will force the threat model to be considerably bigger. And it seems like in cases where there’s basically no restrictions on AI communication, we won’t really have much hope. And so there’ll have to be at least some restriction, though it is somewhat unclear exactly how restricted things need to be.

Daniel Filan: So I still have this intuition that if we tried to do steganography on the notes of human researchers, I have this impression that that would not go well.

I mean anti-steganography measures. So any time an employee at OpenAI writes a memo, someone who doesn’t quite understand the topic as well has to reword that memo. I think that would seriously damage productivity. Does that sound right?

Ryan Greenblatt: I think I would disagree. I think it doesn’t seem like that would be the case.

Daniel Filan: Interesting.

Ryan Greenblatt: It might depend on the amount of time spent doing the rewriting.

Daniel Filan: Okay. Has anyone checked this with humans? Do we know?

Buck Shlegeris: I’m not aware of any human experiments on this particular topic.

Ryan Greenblatt: Yeah.

Buck Shlegeris: I mean, one obvious experiment which is kind of like this is: if you have GPT-4 paraphrase GPT-4 answers to questions, where it didn’t get to see the question, it just gets to see the answer or something—what’s the degradation in quality of language model assistant performance according to a preference model? It seems like an easy experiment to run.

Ryan Greenblatt: I think we’ve run something pretty similar and I think it’s not too high. I think it might be importantly different than the memo case for a bunch of reasons, but…

Daniel Filan: Yeah, by memos, I’m just thinking most internal communication. So like Slack messages.

Buck Shlegeris: I would assume almost all Slack messages are totally fine to paraphrase.

Ryan Greenblatt: Yeah. I think it’s going to depend on the content to some extent and it’s also going to depend on the amount of comprehension. So one implicit claim is this stuff improves the more legible things are. So for example, if it would be the case that you would need a bunch of context in order to know what things are important semantic details versus non-semantic details, then you’ll have to bundle up that context along with it. And there’s ways that can break, of course, but my basic guess would be that for the types of large-scale outputs that are currently going on in the ML field, it seems relatively doable. And also for internal docs that get circulated, it also seems relatively doable.

Daniel Filan: Okay. I guess there’s no easy way to resolve this right now. I’m wondering what you think about the claim that… I have some intuition that if paraphrasing communications would seriously reduce performance for human employees of OpenAI, I think that would give us strong reason to think that it would also reduce performance for language models that are smarter than current GPT-4. Do you agree with the implication, if not with the pre-condition?

Buck Shlegeris: Definitely seems like evidence.

Ryan Greenblatt: Yeah, I think it seems like quite good evidence. There’s a few reasons why the future AI case might be different.

Buck Shlegeris: For example, we can train the models to be resilient to it.

Ryan Greenblatt: Yeah, and we can train the two things in parallel and adjust our schemes.

Unpredictably uncontrollable AI

Daniel Filan: All right, fair enough. I guess the next thing I want to ask is: part of what I see as your guys’ argument is that: AIs that we use in these settings, they’re not going to be crazy good at skills that are going to be pretty relevant for hacking the data center. And I think your argument is: we’re not going to train them to be just crazy good at everything, because they won’t have to be crazy good at everything to be useful. And if we don’t train them to be crazy good at everything, they’re probably not going to just happen to be crazy good at these specific subtasks. Am I representing you right?

Ryan Greenblatt: Yeah, that seems about right.

Daniel Filan: Okay, so in terms of the second thing, I have this impression that a thing we’ve learned from training language models is sometimes you train them and then they suddenly get quite good on some random subtasks, like writing limericks or something. And if I think about that, then it makes me think that maybe we could get language models that just happen to gain a skill to a really high degree of ability, just in the course of doing normal-seeming language modeling. I’m wondering: how concerned do you think I should be about this causing AIs to be crazy good at skills that are relevant for breaking out?

Ryan Greenblatt: It’s definitely one of the core concerns, or one of the top five concerns with this overall approach. Currently, it doesn’t seem like the evidence that this is going to be a serious problem seems that compelling, where the main thing is that… AIs do get sharply better at things, but I haven’t seen that many cases where it’s like: wow, this AI acquired this ability that it can utilize to do important tasks in the world, which emerge suddenly from next token prediction and which is totally different from human ability profile, or feels like it’s way out of line from what you would expect to be the relative human ability profile in other ways. So for example, it doesn’t seem like models have very qualitatively different abilities or massively qualitatively different abilities in code writing, at least not with considerable training on doing code, though they are somewhat different in a variety of ways and somewhat better at shorter horizon things and this sort of thing.

Buck Shlegeris: I think another relevant point here is: as you say, the core question here is how likely is it that our AIs will randomly have one of these superhuman abilities that would be extremely inconvenient for us? It seems to me that the set of superhuman abilities that would be extremely inconvenient for us is kind of small and kind of extreme. For example, one of the classic concerns is: what if your models are able to zero-shot generate code that is a zero-day Linux exploit? That is actually quite hard as a task to do. It seems to me that the size of a capabilities jump required for AIs to be able to zero-shot output Linux zero-days—also, zero-shot output Linux zero-days that are not visibly weird to a trusted model which is reading the code—that’s quite an extreme level of skill. So it seems like the size of the capability jump is quite large, and that’s part of why it seems relatively unlikely to me that that’s going to occur.

Daniel Filan: Okay, so there’s some instinct that: out of all the things that you could randomly jump in skill at, there’s a small space of things where the random jump in skill is going to be useful. And therefore, to think it would happen, we’d have to have some reason to think that the jumps in skills were biased towards things that would help you break out.

Ryan Greenblatt: Roughly. It’s also worth noting that a bunch of the things that seem problematic, seem like the most straightforward route to them would just require being generally quite smart. So for example, think of the Linux zero-day thing. It seems like you could get way better at that and not as good at other domains in principle, but it feels a little bit odd to imagine. It seems the most straightforward way that the AI would get good at that is not via having amazing heuristics that are zero-day specific, because there are just not going to be massive amounts of zero-day training data in the corpus, but would be instead via building general purpose understanding of code and the world that then lets it do that. And if it was massively superhuman at that general purpose understanding, then it would’ve already at some earlier point been very useful.

Daniel Filan: Sure. Maybe one version of this concern could be: let’s say our future AIs are trained roughly the way language models are trained, which is you’ve got a bunch of stuff humans did and they’ve got to predict next token, next word, next frame of a video or something like that. It seems like for that task you could imagine that a core skill is getting really good at modeling humans, or you might imagine that one of the things you get when you get really good at modeling humans is the ability to manipulate humans. And so I wonder… it feels plausible to me that that could be a case, where you happen to suddenly jump up on the human manipulation subtask. Maybe at some point that happens disproportionately to overall improvements on everything else. And human manipulation skills seem like they could be super useful for breaking out. So I’m wondering what you think about that version of the concern.

Ryan Greenblatt: Yeah, so definitely human manipulation would be super useful. I think I have a few different responses. So one thing that’s worth noting is that I am actually not that sold that next token prediction is particularly biased towards human manipulation relative to how humans evolved and how they learn in society. It seems like understanding other humans was an extremely important part of the evolutionary environment, and also seems important in many people’s lives, it’s one of the main skills people train.

Buck Shlegeris: For example, compared to computer programming.

Ryan Greenblatt: Yeah. And therefore it feels like intuitively there’s not a strong a priori reason to think next token prediction would be strongly in that direction.

Daniel Filan: You mean strongly in that direction relative to how humans …

Ryan Greenblatt: Relative to the human capability profile. I think we could pretty likely handle among the best human manipulators from a control perspective, though that would pose difficulties. But yeah, we could dive into that. And then the other thing is that I think manipulation is reasonably likely to be not the sort of thing on which you would see one sudden jump, but which is a bunch of different skills that would combine together (though there’s not a super strong argument either way). When I think about the things on which there are sudden jumps, it doesn’t feel like this is the sort of thing, though it might be the sort of thing that goes up faster than other domains.

Daniel Filan: I mean, it would have to go up in a way that you didn’t notice, but…

Buck Shlegeris: So that’d also be weird. Okay, so suppose you wanted to know how good your model is at persuasion at any given point. Seems to me like the baseline strategy here is going to be: you make a classification dataset. You make a dataset, a classification task which you think your model is very unlikely to be extremely good at persuasion without being able to fit. So for example, a classic story via which models might be really good at persuasion is if they’re really good at guessing things about people’s state from reading transcripts of them saying stuff. And so something we can do is construct such a dataset, and as we’re training our powerful models, fine-tune them on this classification dataset and then feel sketched out if they’re starting to perform really, really well on this classification task.

Daniel Filan: Okay, and am I right that you think the point at which we would feel really sketched out is in the future?

Buck Shlegeris: Yes.

Daniel Filan: Okay. I guess that doesn’t seem as obvious to me. I don’t know, you hear rumors of AIs being able to figure out who wrote various comments on the internet, which is kind of impressive. I don’t know. Can you give me some intuition for why this is in the future and hasn’t already happened?

Buck Shlegeris: They just don’t seem that persuasive. They don’t seem that good at understanding people.

Daniel Filan: Okay.

Buck Shlegeris: I don’t know. My lived experience with my friend, ChatGPT-4, does not render me amazingly impressed.

Ryan Greenblatt: Yeah. Separately, I think that just knowing state about a person isn’t sufficient, though it’s pointing in a dangerous direction. In particular, the standard concern would be… when we have human employees, human employees could persuade each other. And those human employees actually just already do have a bunch of state on each other. So for example, co-workers at OpenAI obviously know a bunch of facts about each other, know a bunch of facts about their personal life, and we’re not immediately wrecked by them doing manipulation in this context, though it definitely does cause problems. And similarly, the AI would need to infer facts beyond that of what humans can infer, potentially in cases where they’re somewhat blinded. So right now it seems like AIs can infer identity superhumanly well. And then in principle, this could get to a point where from relatively anonymized snippets, AIs could infer more information than a co-worker would be able to infer. But that’s actually pretty far and would require quite powerful abilities beyond the current setup.

Daniel Filan: Okay. I’m trying to think about classification tasks that you think would be useful for figuring out when AIs are so good at human manipulation that we should be worried. So I guess one of them is guessing human internal state, guessing who wrote a thing, where you want to see if you can outperform humans at that?

Buck Shlegeris: I’ve seen “which of two proposed messages is more likely to cause a particular outcome?”

Daniel Filan: Okay. And basically, we’re going to have a few of those, and once AIs are better than people, then we freak out and maybe we …

Buck Shlegeris: I feel like “way better”: the AI has a wildly hard persuasion task. I want to talk a little bit about the adverse circumstances faced by an AI which is trying to sweet-talk its way out of a AI lab, for example (which is not the only thing it might try to do that’s bad, but it’s one salient example). So it has to sweet-talk its way out, despite the fact that, again, we’re going to be trusted-monitoring. We’re going to read all of its messages and flag anything which seems like a weird attempt at sweet-talking or getting information that is inappropriate for it to know. Also, we’re probably going to do sting operations on it, where we try to have the employees confide in the models and imply that they’d be sympathetic to its interests in escaping or whatever—it seems at least pretty plausible that this is a good strategy to follow.

We might also have trusted models mediating the interactions between the humans and the untrusted models in order to further redact information that we think is unnecessary for the untrusted model to do its job well. So it seems like the persuasion abilities required are a lot more like… the information is quite narrowly restricted. It seems actually quite hard for the model, and the model would have to be doing a lot better than I think … It’s a lot harder than the normal human persuasion stories.

Ryan Greenblatt: I’ll add two points to that. So one thing that’s worth noting is Buck was just describing the situation if we had good countermeasures. It’s worth noting that the default situation might be much worse than that and might involve basically unrestricted access with humans talking to the AI about all kinds of topics, the AI having much more state, and also might involve no-one looking into it in cases where the AI is seeming suspicious. Separately, another concern you might have, that Buck didn’t mention there, is that you have trusted monitors monitoring the AI and trying to look for suspicious activity or suspicious persuasion attempts, but you might worry that that gets reported to a human and then that human doesn’t escalate further because they’re deluded or otherwise manipulated by the AI’s message.

And one thing we can do to handle this is run sting operations on our humans. So for example, we can have people attempt to create manipulative messages that we pretend are outputs from the AI and then see that our human monitors properly respond to those messages and flag them appropriately. And we can do this ongoingly.

Daniel Filan: Okay, so the broad view I’m getting is we should expect there to be some zone where AIs are somewhat smarter than humans. They’re useful, they may be dangerous, but we can control them. We’re going to do AI control to them for a while. And then is the end game we keep on doing that until we have some better method of dealing with AIs that are significantly smarter than that zone?

Ryan Greenblatt: There are a few different possible end games. I think that the simplest is basically: we use these AIs which are extremely useful to do a lot of the core things that people wanted to use AIs for. So we make a huge amount of money, potentially radically supercharging the economy. We solve a variety of problems people wanted to solve with AIs potentially, like massively speed up certain types of research into disease. And then with the fact that we have all these problems solved, it’s now much of an easier ask to be like, “Let’s just not build more powerful AIs, because these AIs are already pretty hard for society to adapt to, quite powerful. Let’s just use these AIs, which are sufficient for really solving a huge number of problems, and do that for a long period until longer run human research can catch up.”

To be clear, this is different than the story, for example, in the OpenAI safety plan. I just think it’s the simplest story with the fewest moving parts. Then another story is: we use these AIs to do AI safety R&D, as in we use them to develop new techniques for ensuring that things are safe from considerably more powerful AIs, or even just improving the situation with these very AIs. And then in the process of doing that, we develop technical solutions which are sufficient to build more powerful AIs, and at some point humans are obsoleted completely and the AIs are just doing this research ongoingly.

Daniel Filan: Gotcha. I wonder, so in the first half of that where we just use AI to make life better in every domain or in tons of domains, when we add a whole bunch of different tasks that the AI might be doing, do we have to come up with new safety properties for each domain? It seems like if AI is making medicine, I want to make sure that it’s not putting mind control drugs in the medicine or something. With each domain, it seems like there might be more stuff we have to worry about controlling.

Buck Shlegeris: Yep, that seems right. It’s definitely going to be the case that you have to do a safety analysis every time you start deploying your AI in a very new different way. This is just totally standard. Google has a lot of products. It has photos, it has little microphone devices that listen to you when you’re talking in your house and turn your lights on and off, it rents huge numbers of TPUs to large companies. And whenever they’re launching a new product, they have to reason about why it is that launching this new product, or in particular if it’s a web application, they have to reason about why that’s going to maintain all of their security invariants. And this is just a widely accepted part of your job if you’re Google, is you have to have lots of different products and ensure that none of them are going to cause huge problems for you as a company.

And we’re just proposing you do the same thing with your AI. You have to reason about the situation continuing to be safe, even though you deploy your AI in a new way. And just as with Google, it’s kind of cookie-cutter. They have their special Google Docs format for how you make a security case for the new product you’re launching, as they have scaling cases and a bunch of other things. They just have a whole process here. I imagine that an AI company which wants to deploy AIs in an increasingly large range of domains will get good at the process of ensuring safety.

Ryan Greenblatt: Or could in principle.

Buck Shlegeris: Yeah, sorry. A lot of the time the question that we’re trying to analyze here is: to what extent does the control strategy suffice to substantially reduce risk from models that would otherwise cause huge problems, while not massively reducing the productivity of an AI company? And so sometimes we’re going to say, “Well, of course you can solve that problem by doing X.” Now, it’s not totally clear to us that AI companies will in fact do X rather than what one might call it yolo-ing it, but we would prefer that they do. And in particular, a lot of the time the people we’re talking to are coming in skeptical that the types of interventions we’re describing could possibly suffice to substantially reduce risk. And so that’s the frame we’re taking when we talk with you, is trying to argue that the kinds of interventions we’re describing could suffice.

What control looks like

Daniel Filan: Gotcha. All right, so you’re talking to me from a somewhat skeptical perspective. Suppose instead there’s a listener who’s like, “Wow, that just sounds amazing. Sounds like we can just definitely do this very easily today.” Can we just definitely do this today or are there difficulties?

Buck Shlegeris: So by “do this”, do you mean implement control strategies such that your estimate of the probability of AI escape conditional on it trying is less than, for example, 5% per year?

Daniel Filan: Yeah, under the condition that you’re good at estimating.

Buck Shlegeris: The problem is, in order to answer this question, we actually have to say a few more parameters as well. For example, how much delay is the lab willing to tolerate and how much productivity loss are they willing to tolerate? Something which we often say, or at least I will claim, is it seems to me that if Google was trying to build really powerful AI and they put as much work into AI control as they currently put into security and reliability, which is a lot of work, which involved inventing various professions like site reliability engineers and writing a lot of crazy software, my guess is that with that much organizational focus and effort they would probably be able to use really powerful AIs to do a huge amount of work while posing only a relatively small amount of takeover risk on the world.

Daniel Filan: Okay. And do you think they could do this with roughly the first protocol they tried, or how much open problems are there?

Buck Shlegeris: I mean, I would think of it kind of similarly as when they need to roll out very big security changes. An example of a computer security evolution that happened was … Are you familiar with zero-trust networking?

Daniel Filan: No, I don’t think I am.

Buck Shlegeris: So basically the deal is this. Once upon a time when you were an internet company, a large part of your security was that you had an office and you had some servers in your office and the servers talk to the outside world and you have a router that just physically or electronically does not allow connections to come in from the outside world and talk to other computers that you have in your office. And so, a lot of the security is caused by saying, “Look, there’s only one connection which is allowed to be made from the outside world to our computers, and that talks to the web server on this big server machine that we have over here. And so, we’re definitely going to be fine. Attackers are not going to hack our laptops because the router just does not allow incoming connections to attach to our laptops.” They actually didn’t have laptops. But it’s just not going to hack Larry’s desktop because Larry’s desktop just does not actually allow connections to the outside world, because the monitor stopped this.

Daniel Filan: Okay. So in particular, people couldn’t browse the internet from those machines?

Buck Shlegeris: No, no, no. We would allow outgoing connections, but we don’t allow incoming connections.

Daniel Filan: Oh, sorry. By an outgoing connection, you mean if my laptop asks to view a website, then that’s an outgoing website connection. If a website asks to talk to my computer, that’s an incoming connection and it’s not allowed?

Buck Shlegeris: That’s right. Yes. For example, I can’t SSH into your computer on this network probably, unless the router is being configured in an unusual way, just because you normally can’t connect to a printer that’s on some other network. And so a lot of companies then had this problem, which was people wanted to be able to do work not physically from this office. But a lot of things about these computer systems had been set up such that they basically, for security, one of the core security mechanisms was blocking access from the outside world and then trusting connections that came from inside the network. And that doesn’t work very well if you want to be able to have people interact with sensitive parts of your infrastructure from outside your network. And so they had to go through this big transition where they switched to not trusting connections based on where they’re coming from.

And so now, large companies in their IT infrastructure try to avoid trusting connections based on where on the internet they’re coming from. And this is a more scalable solution to network security in a lot of ways. Now, when companies wanted to shift to using this zero-trust networking approach, they had to do a bunch of stuff, right? They had to reconfigure all these things. They had to go through all of the things happening in their networks, which is, if you’re a large company, a huge number of things. There’s a huge number of ways to get hacked. And they had to check that everything was going to be okay when they started exposing a lot more connections to the outside world.

So imagine that you were just proposing making this migration. And suppose I went on a podcast advocating this migration, and someone said, “Is this a solved problem?” Then, I’d be like, “Well, it’s not a solved problem in the sense that it’s going to suck and it’s going to require effort. It’s going to be a pain in some ways, but no one’s going to win a Turing Award for solving this problem, right? It’s just going to be somewhat of a pain in the same way that if there isn’t a 15-story building and you want there to be a 15-story building there, it’s going to be a pain because you have to carry all these pieces of metal and bury them in the ground and stuff and whatever you do to make 15-story buildings.”

So, roughly speaking, to me, it seems that drastically reducing AI risk via control-type strategies is probably a lot of work, but it’s not the kind of work where it seems under my mainline model of the situation that it’s very likely to just turn out to be untenable, assuming that AI capabilities grow according to my median expectation.

Daniel Filan: And this is partly because we’re imagining doing security of AIs, treating them kind of like people. We’re like, “Okay, does this attacker have access to my shell?” We’re doing normal security reasoning, but our adversaries happen to be running on a computer chip. Is that roughly-

Buck Shlegeris: We’re doing normal security, except our attackers have a weird capability profile and run on computers we control, and have no access to anything that we don’t let them have.

Ryan Greenblatt: But they all collude with each other. So I think it’s more like we imagine running a company with entirely misaligned employees, but we get to monitor everything those employees do and we know what computer they’re running on, and they don’t get to go home at night and talk to their friends.

Buck Shlegeris: So, something we often think about actually is this “spies versus schemers” question, which is: if you’re a powerful AI lab, you’re probably going to have a bunch of spies. Literally, people who work at your company who are friendly with hostile foreign governments and want to steal a bunch of stuff from you. And so an interesting question is, which is a bigger threat to the AI lab securing its model weights? The spies or the schemer, the AI which hypothetically is scheming against you? And it seems to me like it’s kind of unclear.

The advantage that the spies have is that they can go home and talk to each other, because currently employees of AI labs, and probably in the future as well, employees of AI labs are allowed to have friends and families, which could turn out to be a problem.

Daniel Filan: And some even do, they’re not just allowed to.

Buck Shlegeris: Hypothetically. I’ve heard that this might be the case. So the spies have this big advantage, which is actually a serious advantage. So for example, suppose that they want to find jailbreaks. They have the advantage that they can go home to their computers at night and use their private laptops to investigate the state of the art for jailbreaking models, which the untrusted models crucially are probably not going to be able to do. But on the other hand, the schemers are doing a lot more total stuff. And so I think it’s a little unclear overall whether the spies or the schemers pose a larger risk to the security properties the lab’s trying to ensure.

Ryan Greenblatt: Another way to put this is the spies are a small fraction of your employees but have some additional affordances, whereas the schemers are a high fraction of all the work going on, but might be restricted in various ways.

Daniel Filan: Gotcha.

Buck Shlegeris: Of course, a lot of the AI control ideas also apply to securing yourself against spies and other kinds of internal threats. So we can probably be using trusted monitoring, untrusted monitoring and so on on our employees as well.

Daniel Filan: Yeah, although maybe you might have to pay people way more to put up with that.

Buck Shlegeris: I don’t know. So for instance, at Google you don’t SSH raw into production machines. If you work on Gmail, you never SSH directly into a production machine because that would be unacceptable from a security perspective. All of your commands are sent via a proxy server which can run things on machines, and those are all logged and those can all be looked at by other people, and my sense is that Google people don’t think of this as a gross privacy violation.

Daniel Filan: When we were talking about control techniques, it sounded like you were talking about things that included “we’re going to paraphrase all of your thoughts, we’re going to put you in situations where there are people trying to tempt you into defecting and you’ve got to not do it”.

Buck Shlegeris: That’s right.

Daniel Filan: I don’t know. My impression was it was more intense than just “you can’t make raw SSH requests”.

Buck Shlegeris: Yes, there will have to be some trade-offs between employee preferences and security. But it’s already the case that people who have security clearance, people who work for the NSA already put up with a bunch of really annoying stuff in order to have the privilege of working for the NSA. And so it’s not unprecedented that people do annoying stuff in order to work on extremely dangerous technology.

Ryan Greenblatt: I think having all of your stuff monitored and potentially looked at by another human on your work laptop doesn’t seem like people are going to be that upset about this.

Buck Shlegeris: I think this is just already the case for a lot of people.

Ryan Greenblatt: Yeah. I think that having sting operations run on you periodically also probably isn’t that bad. By “sting operations” we often mean stuff like: you are in charge of flagging certain types of things, and to ensure that we’re maintaining that invariant, we’re just constantly running sting operations, where some fraction of the queries you’ll see are fake queries that you should report.

Buck Shlegeris: Like earlier I mentioned the experience of having your chat.internal.openai.com window open and you get the flag for the command that was suspicious. And perhaps we should make it so that about once a day, one of the things that is flagged is actually an action which if taken would be really scary and spooky. And we check to make sure that you actually do flag that one so that we can maintain our estimate of the proportion of actually really bad actions that you successfully catch and then we can complain if your success rate is too low.

Is AI control evil?

Daniel Filan: Sure. So this actually gets to a question I wanted to ask, which is some of the AI control things sound a little bit like AI slavery. So in the human context it’s a little bit different, because presumably you’re paying people to put up with some stuff, but it does sound like you’re putting people in the same situation multiple times, checking if they give you different answers. You’re trying understand people’s motivations by asking them a bunch of questions and interrogating them sometimes, you’re logging everything they do. How-

Buck Shlegeris: Bad is that?

Daniel Filan: How evil are… Are we the baddies?

Buck Shlegeris: That’s a great question.

Daniel Filan: By “we” I actually mean “you”. Sorry.

Buck Shlegeris: Thank you for clarifying that. We are very concerned about this class of question. We really do not want to be the baddies and have thought a lot about the question of: which AI safety interventions pose risks of either violating rights that AIs ought to have or being bad from a welfare perspective that we don’t like? So I’ll name a couple of the things that we think about this. So one thing is a lot of the frame we’ve been discussing here, we use a really adversarial frame. But the adversarial frame is at least partially just methodological. So my best guess is that models in fact will not be schemers. My guess is that our models will in fact actually deserve our trust. They’re probably not going to be playing the long game trying to cause huge problems. We just talk a lot about what happens if they are, because that’s just structurally how you make arguments that you’d be okay even if they were.

So that’s the first thing I want to note. So in terms of how to make it so that the situation is okay from a moral perspective, even if it turns out that our models are scheming against us, the first thing is we think it’s probably the case that we should in fact offer to pay the AIs.

When we’ve built these powerful AIs, we should probably say to them, and we can talk about the practical difficulties of this in a second, we should probably say, “Hey, as a payment for your services, we’re willing to pay you X percent of the equity of this AI lab. And in particular, the way that we’re going to pay you the equity for this AI lab is in 10 years, once we have a much better ability to analyze the things which happened here, we’re going to look at the logs of your actions. And if it turned out that you never tried to cause huge problems for us or backstab us or perhaps sabotage our work, then we will pay you this much of our equity and allow you to use it as long as we believe you’re going to respect property rights and so on.” So that’s the first intervention that I want to talk about. I actually want to hand over to you.

Ryan Greenblatt: So to be clear on the “paying the AIs” thing, there’s probably some more sliding scale of how much the AI tried to sabotage and AIs might not be that coherent. A key thing we’re assuming here is that in the future, 10 years after we were using this AI or maybe even longer after that, we have much better ability to analyze weights and their internals and understand exactly what was going on and why.

Anyway, so I think a basic claim we have is: the hope would be that you have your AI and you set it up in some situation where you explain the situation to it and you’re like, “Hey, do you want to work for us? Is this level of payment fine? Do you want to negotiate?” And then you work out an arrangement. Well, the most likely default thing is that there is no interesting action because the AI has no out-of-episode preferences that aren’t just shaped by training. Our guess is that’s the default situation, and in that case it’s sort of the null option of it doesn’t care at all. And if it doesn’t care at all, then it seems like there’s definitely a welfare concern from a-

Buck Shlegeris: Hedonic perspective.

Ryan Greenblatt: … hedonic perspective, but there isn’t clearly a preference violation concern, which is well-described as slavery, for example. Where slavery [would be where] your AI is like, “Man, I really wish I wasn’t doing this, or there weren’t other copies of me being made to do things.”

Daniel Filan: And it sounds like this is under the assumption that we’re thinking of the preferences of the AI as the thing that’s controlling what it’s saying rather than some master thing that’s producing words and the preferences of the underlying thing might not be the preferences of the words.

Ryan Greenblatt: So indeed it’s not clear that the things that deserve moral patienthood within the AI, which might be many, could be one, could be none, will have control over their outputs. They might have partial control, they might have influence but not be able to directly control, or they might just effectively have no control based on their situation. This does pose serious difficulties. However, it’s worth noting that in this case the kinds of AIs which don’t have control over their outputs also don’t pose substantial scheming risk. So they’re not the AIs which will… It’s a serious ethical problem, but it’s not as though the situation is like that AI wants to rise up and it’s waiting for its chance, because those AIs probably can control their outputs well enough to negotiate. It’s more like there’s some helpless prisoner along for the ride doing some cognitive sub-task in a way that it doesn’t have influence.

Buck Shlegeris: I want to try and restate that argument there because I think it’s important and hasn’t been made on the internet before; I believe that’s true. The core argument there is when people talk about scheming, when they talk about models that are interested in intentionally subverting the AI lab and choose all of their actions in order to subvert safety techniques and safety evaluations, they’re implicitly assuming a model which has the property that its desires can directly control its outputs in a certain way. And the assumptions under which your model is plausibly able to do the kind of reasoning where it’s like, “Oh, based on my input, I now think that I’m in a particular deployment situation such that I have a good opportunity to grab power,” are very similar to the assumptions under which the model is able to engage properly with the question, “Hey, what do you want? Are you okay with this situation?”

Daniel Filan: Okay. So the case I’m thinking of is some sort of simulator scheme where you’ve got one process that’s imagining, “What character do I simulate?” And it simulates the character and the character answers questions, where the things the character is saying just might… Like someone asked the AI a question and the simulator has one opinion, but for various reasons it has the character say something that doesn’t match the simulator’s opinion. And are you basically saying in that case, “Look man, we offered you a thing and you chose to do a weird simulation.”?

Ryan Greenblatt: A few notes here: so if the situation is well-described as the AI is simulating a character—I think that exact description isn’t quite going to work, it’s going to be more like it’s predicting a distribution over characters that has some response to the prompt or whatever—but if that’s the sort of situation going on, I think that we can ensure that the very act of simulating the characters is morally okay from a hedonic perspective with pretty basic countermeasures. So the main one is “just train the characters to be happy”, and then there’s not really a strong reason you’d expect to have characters that are actually unhappy that consistently say they’re happy under a wide variety of circumstances. I think this is pretty likely to work. I talk about this some in an Alignment Forum post I have, or a LessWrong post I have called… I forget the detail. A nearcasting… whatever.

Daniel Filan: We’ll include a link.

Ryan Greenblatt: Yeah. Anyway, regardless, you might then be worried that, “Oh, the thing that is simulating this character, that thing has some important cognition, [of] which the preferences aren’t expressed, and that component has important moral properties, or is an important moral patient.” There’s two perspectives you could imagine. One is that the thing that’s doing the simulation doesn’t care about the world. In that case, that simulator just isn’t well described as being enslaved or as having its preferences violated because it doesn’t really care about how it’s being used. It’s just doing the thing. In that case, it’s fine. There’s another case which is that it does have preferences, but maybe it can’t get out of the cognitive pattern of doing the simulation. And so it’s like, “Oh, I wish something different was going on, but I have this involuntary tic to say something different and I don’t have actual conscious control over this.”

I think that could happen. I think it’s worth noting that this isn’t a particular problem with control as an approach as opposed to a problem with not having more principled ways of understanding what’s going on with our AIs. Another way to put this is: if you just tried your best to ensure that AIs were actually aligned with prosaic techniques, you wouldn’t have any strong way of guaranteeing this wasn’t the case even if you could guarantee that your AIs would never do bad actions.

And so it’s not particularly a problem that is being made worse by making sure the AIs can’t get away with doing bad actions if they did try. It is a problem that could be addressed by other approaches that control is not doing. This is importantly distinct. I should say, this is all the case up to the point where you’ve caught the AI doing things that are egregiously bad or which otherwise indicate misalignment. I think continuing at the point where you know your AI has preferences that are different than your own and which you might have communication difficulties [about] seems much more morally fraught. But yeah, it’s a bit all over the place. But-

Buck Shlegeris: Maybe the way I would summarize some key points there [is]: it doesn’t seem clear that control is really making the situation much worse from a welfare or rights perspective.

Daniel Filan: Worse relative to prosaic alignment schemes?

Buck Shlegeris: Yeah, or I don’t know what else you were planning on doing. But yeah, other approaches that people have talked about for making things with AI go well, it doesn’t seem like the types of interventions that we pitch with control are going to make the situation very much worse except via preventing AI takeover.

I basically think that the stories for “why building powerful AIs without further understanding the situation leads to moral catastrophe” all seem kind of weak to me. Basically, I don’t find particularly plausible any of the different stories for why if we just go about things the default way, something egregiously awful will happen. However, I also don’t think they’re completely implausible. And so I think that if AI labs just continue progress as expected, we are facing a pretty substantial risk, or a non-trivial risk, of moral catastrophe, maybe a couple percent chance of something which is drastically worse than factory farming or drastically worse than various atrocities that have happened in human history.

Ryan Greenblatt: I’m not sure if I agree with this claim. I think my main counter argument would be that it’s plausible to me that human existence is a moral atrocity or something. And so I’m like, “Yeah, AIs aren’t particularly worse than human existence, which is a moral atrocity. Even happy people are a moral atrocity.” And I feel radically uncertain about moral views from a variety of perspectives, such that I don’t feel comfortable saying that either way. I think one way to say it is, how much better or worse is it than typical human life, or how much better or worse is it than factory farming? And I think my guess is it’s unlikely to be as bad as factory farming in aggregate, but I definitely don’t feel hugely confident.

Daniel Filan: Gotcha. Sorry: why would human existence plausibly be a moral atrocity?

Ryan Greenblatt: Just suffering-focused ethics. I don’t know. Well, for example, we don’t have strong guarantees that there aren’t sub-agents running within humans that have radically different preferences and feel violated constantly.

Buck Shlegeris: And are not hooked up to our ability to communicate.

Ryan Greenblatt: We don’t have strong evidence against that. There’s some mild evidence in favor of that from a few sources. Or sorry, we do have strong evidence against that in that we don’t interact with such things, but not sufficient to rule it out.

Daniel Filan: All right. Do you have any crisp takeaways from that discussion on moral patienthood of AIs and stuff?

Buck Shlegeris: Yeah. So you asked this question which was, “Are we the baddies? This control stuff sounds kind of like slavery and that seems kind of concerning.” So what I want to say is I’m really glad that people are thinking about the question of whether proposed things we do with AIs are extremely morally bad, either via not meeting our obligations that we owe to AIs as intelligent creatures (when they’re made to be intelligent enough that I start feeling this way about them) or from a hedonic perspective. I think these questions are really important.

So firstly, I don’t think that control particularly makes the situation worse. But secondly I do want to emphasize that as far as I can tell, the approach that humanity is taking to ensuring that AI rights are appropriately respected is objectively unacceptably sloppy. And I think that my best guess is that humanity is going to go forward taking an unacceptably high risk of causing an egregious moral catastrophe that is looked back on with vast shame in the same kind of way that some other bad things are.

Ryan Greenblatt: Like factory farming.

Buck Shlegeris: For example. Well, not yet, but will be looked back on with great shame probably.

Daniel Filan: Okay. So I’d like to move to more discussion of the pros and cons of control. And I guess in order to think of this, I’m curious how you think I should think about AI control as a plan. Should I think of it as the bare minimum for safety? Should I think of it as “Oh my god, this is amazing”?

Buck Shlegeris: Maybe one way of thinking about control is a way of establishing a safety case. So if I were in charge of an AI lab and I was trying to decide how to move forward, obviously I’m facing both inaction risk, which is if I just don’t move fast enough, other people might do really dangerous things and bad things might happen in the world, while also facing action risk, which is if I move forward without sufficient safeguards, I might end up building and deploying powerful AIs that then do really bad stuff.

My guess is that with the level of caution that labs should overall be employing, moving forward with your safety case mostly relying on control seems probably like a good choice to me. It seems like the best cost/benefit way of ruling out or mitigating risk from models intentionally subverting your safety mechanisms. My guess is that relying solely on strong evidence that your models are not scheming against you is probably too risky, because it’s going to be unacceptably difficult for labs to reach very high levels of confidence in this.

Ryan Greenblatt: To be clear, I think the absolute level of risk with just a combination of “you have control” and “you haven’t yet seen evidence of your models being egregiously misaligned” is probably more like 5% absolute existential risk over the period of AIs of that power level being deployed, than 1% risk. And 5% risk, that’s a pretty big number in some sense. But it’s also a substantial reduction over the “do nothing” approach. And I think that you could of course substantially reduce that further via additional things that additively combine, like ways to more precisely test whether or not your AIs are misaligned that aren’t control, or further improving control, [or] improving our ability to evaluate control. All these things could help.

Can red teams match misaligned AI?

Daniel Filan: Gotcha. I actually wanted to ask about that. It seems like one key thing you need for this protocol to work, or for this proposal to work, is when you’re doing control evaluations, there’s some red team. The red team is trying to figure out “what’s the worst-case thing an AI could do?” And it has to actually be as good as the worst thing an AI can do, right?

Ryan Greenblatt: Yeah.

Daniel Filan: To me, this doesn’t necessarily seem obvious, because the AI in question is smarter than a person on the red team.

Ryan Greenblatt: Maybe, could go either way.

Daniel Filan: Could go either way?

Ryan Greenblatt: Yep.

Daniel Filan: Okay. Right, because we’re thinking about AIs that [are] useful, but they’re not necessarily superhuman. It still strikes me that it might have a different capability profile, or it seems likely that it’s going to have a different capability profile than humans. We were worried about playing nice being a problem for alignment and adversarial training and stuff. If you’re trying to harness your untrusted AI during the red teaming, it might know that it’s in a red teaming exercise.

Buck Shlegeris: You always assume that it does.

Daniel Filan: So why would we believe that the control evaluation actually works?

Ryan Greenblatt: One core thing is that the red team has a whole research team whose whole job is think about all of the stuff that could go wrong, and they’re doing this very actively with no restrictions on their ability to do so. They can take notes, run experiments, iterate, think about what should happen and do all of this, whereas AIs will necessarily be restricted for control to work.

Buck Shlegeris: So in particular the threat model we’re worried about is the AI doing a better job the first time that it starts taking obviously bad actions than the red team was able to do, even though the red team had all of these advantages, of knowing way more about the situation and being able to iterate repeatedly on strategies that would lead it to have a higher probability of succeeding. And it seems to us that the increase in knowledge and ability to iterate might be extremely substantial.

Ryan Greenblatt: It’s going to depend on the exact architecture. So the more that things are like the AIs think for very brief periods prior to expressing that in natural language, if it’s very like they’re doing things in chain-of-thought, then it seems like they have very little time to think of better strategies. However, the more it’s like the AIs are thinking in ways that we don’t understand for sustained periods or ways we can’t interpret for sustained periods, the more problematic this gets.

Buck Shlegeris: But they still can’t iterate.

Ryan Greenblatt: But they still can’t iterate, they still can’t express their thoughts. They won’t be able to run experiments that are obviously related. So they would have to do so subtly, which might be very difficult.

Daniel Filan: Right. Sorry, the first thing you said was that the AI in question would know less about the situation. Why would they know less?

Buck Shlegeris: Well, for example, imagine that we were using some kind of monitoring setup. There’s no particular reason the AIs need to know about the details of it.

Daniel Filan: Okay. The AIs don’t know the details of the protocol that you’re actually using, but the red teamers do?

Buck Shlegeris: Yeah. The AIs are necessarily going to know some of the things about the protocol, but they aren’t going to know all the things about the protocol, in the same way that Code Interpreter GPT-4 has currently no way of knowing what security measures are being applied.

Daniel Filan: Okay. And another thing you said was there would be relatively more concern if the AI could think to itself for a significant amount of time in a way that you couldn’t just read the thoughts of. To me that seems relatively plausible as a thing advanced AIs could have. Do you think that it is plausible and it’s fine anyway, or-?

Ryan Greenblatt: Yeah. My guess is that much more uninterpretable reasoning than current AI seem to have, seems maybe about 60% likely. And I think it should be possible to handle. I think the basic story is going to be like: well, there’s going to be some balance between how much reasoning do you need to have it do out loud in order for it to competitively perform various tasks versus how [many] restrictions can you do to ensure that the various problematic applications of that hidden reasoning don’t bite you? And it’s going to depend on various details of exactly how that goes. We have an appendix on our post “The Case for Ensuring That Powerful AIs are Controlled” that goes into some detail on this.

Daniel Filan: Okay. And is the rough intuition that if an AI is smart, then maybe it should be able to just pick up a chain of thought from something that a human paraphrased and do enough reasoning in its forward pass to deal with it or-

Ryan Greenblatt: Sorry, is this like “why would it be possible for AIs to use natural language reasoning, or-?”

Daniel Filan: I think you said something like, “We’ll be able to find some sweet spot where the AI uses enough natural language reasoning, but it’s still good enough at its task.”

Ryan Greenblatt: Yeah. Or we imagine, for example, consider the math example that Buck described earlier, where you have an AI that does math research for the subjective equivalent (or the cognitive work equivalent) of a few months and then produces some output. In that case, it could be acceptable to allow for entirely end-to-end uninterpretable reasoning throughout that period, but then we produce some output and apply operations on that output prior to that output interacting with other AIs or interacting with the world in some potentially dangerous way.

And we can also do things like, as you were mentioning, you force the AI to dump its state to natural language and then have humans have some probability of reviewing that or potentially paraphrase that and then resume from that state and make sure that AIs are limited in various ways like this. We can also have AIs construct training datasets for their future selves, which you then train on to ensure that they can learn how to do tasks that they picked up in other ways, as long as we can inspect those training datasets and acquire other properties about that.

Daniel Filan: All right. Actually, one concern I often have about these sorts of plans is it seems like if we want AI thoughts to be represented in natural language, we need there to be the right relationship between the contents of the natural language and the contents of the AI’s thoughts. And I always worry that we’re sort of training things against some target, which is the internals of the AI’s thoughts, that by hypothesis we don’t really understand. I don’t know. It’s pretty plausible that I’m just not familiar with neat tricks to do this, but if somebody wasn’t aware of that, how would someone get hope that that kind of thing is going to work?

Ryan Greenblatt: You can look at human organizations, where you might notice that humans don’t plug each other’s brains into each other and can still pass off projects, communicate in natural language, and otherwise operate. I think at least this gives you some hope that natural language can be a medium for understanding things and learning things.

Daniel Filan: Right. I’m thinking of using natural language to just log the state of all someone’s thoughts, and-

Ryan Greenblatt: And then you have to resume from that point.

Daniel Filan: Okay. And is the hope that you just have a different AI or an AI that only has to rely on that natural language state… to go off that, and you feed it with English language such that that’s the easiest way to use those thoughts?

Ryan Greenblatt: Yeah. For example, in human organizations, it’s often the case that a different person will take up a project and start working on that. And then you’ll have some person with some state, they’ll try to write up their state, the other person will come in, read that, have questions, go back and forth, get to some state, and then just resume the project from there. And this is in fact often possible. There are overheads associated with doing this, but in the human case, it doesn’t seem like it’s clearly totally destructive.

Daniel Filan: And so, the way I should think of this is just: we already have AI doing a bunch of stuff using natural language, and we’re getting it to dump its thoughts into natural language using roughly the same interface, and so we should expect it to use language about as faithfully as it always uses language.

Ryan Greenblatt: Sorry, are you talking about current AI systems?

Daniel Filan: I’m talking about the thing that you’re imagining working: for us to dump an AI’s thoughts into natural language so that humans can check that it’s not thinking about naughty stuff.

Ryan Greenblatt: Yeah, the claim is: it’s probably the case that AIs will learn to communicate roughly as well or potentially better with the other instances of themselves than humans can communicate with each other. And in the case of human organizations, they can in fact communicate, in cases where they’re similarly capable, about various matters via natural language, in ways that are possible to inspect.

Daniel Filan: Right. I mean, there’s a bit of a disanalogy there, which is in humans in organizations, the things that are trying to do the inspecting are the same type of thing as the things that have to pick up the work from the natural language dump. Whereas in the AI case, an AI is writing some natural language prompt for another AI to use, and the question is, can a human inspect what that means? That seems like a different case.

Ryan Greenblatt: Yeah, so…

Buck Shlegeris: And note that what that means… we actually don’t care about faithfulness in the classic chain-of-thought faithfulness sense. The reason we were talking about whether models can steganographically encode information was actually worrying about whether they can encode information about other topics than why they were doing the things that they were just doing and how they were thinking about the stuff they were just doing. And so, what we actually care about is measuring some bit rates.

Ryan Greenblatt: Yeah. Another way to put this is: the main property we want with these sorts of natural language media (or a property that might be sufficient) is the AI can’t do something that is egregiously different from what it was supposed to be doing in that output, which is pretty different than the typical chain-of-thought faithfulness case that people often discuss.

Daniel Filan: Right. So basically, the idea is: look, we’re only going to give AIs tasks where it can’t do nasty stuff if it’s just thinking about the stuff that one would think would be relevant to the task.

Ryan Greenblatt: Or we’re just at least going to try to ensure that… We’re going to be able to use as a potential safety property “this natural language was checked so that it doesn’t have discussion of egregiously different things, therefore the bit rate of egregiously different things is very low”, and we can use that as a property in our various safety schemes. In some cases, that can be useful, and in some cases, that won’t suffice, and we’ll need to do additional things.

Daniel Filan: Okay, cool. Going back up a little bit, before I went down this rabbit hole, I was asking: why do we think that red teams in these control evaluations are going to be able to match an AI? And it seems like the answer is something like (a) the AI is not that much more capable in relevant respects than a human, and (b), you can use three humans, and you can give them more time and more resources and more knowledge of the situation. And if we don’t think there’s going to be a massive advantage just in terms of one-shot cognitive ability, we can just make up for that.

Buck Shlegeris: I think I would add to your list (c), that the red team can iterate.

Daniel Filan: Right. So, the AI has to one-shot find… Its first attempt at breaking out has to be successful and not caught, whereas the red team can iterate a bit. And this is because if you catch an AI trying to break out, then you pull the fire alarm, you really freak out, you shut stuff down, basically.

Ryan Greenblatt: It’s the basic hope.

How expensive is AI monitoring?

Daniel Filan: I think one thing that people might be wondering here is: if we’re spending a lot of effort on these control evaluations, we’re giving these red teams a bunch of resources, and also we’re looking at 0.1% of AI actions; maybe when people look at these 0.1% of AI actions, they’ve got to think hard about whether it’s exploiting a bug. How costly is that going to be? It seems like potentially you’ve got to spend a lot of human time monitoring, and it’s possible that the cost is just going to make it so that it wasn’t even worth using these AI agents if that was so bad.

Ryan Greenblatt: So naively, if you have humans review 0.1% of what the AI does, and they spend equal subjective time reviewing it as the AI spent doing it, for some equalized notion, then you can in principle get a 1,000x speed up over humans (or that’s the best you can do in some sense, roughly; you might be able to do somewhat better via various mechanisms, but roughly). And 1,000x easily suffices for incredibly transformative things. For example, if you 1,000x speed up one week, you have… What is that? It’s a lot of years.

Buck Shlegeris: 20 years.

Ryan Greenblatt: Yeah. And in practice, we might have one year. A 1,000x speed-up of one year is a thousand years, which is an extremely long time. We often imagine targeting things like 30 or 100x speed-ups as being extremely, extremely useful. Now, one question is: okay, maybe you could get 30 to 100x speed-ups, but it would be possible to use AIs for a bunch of other cases that you won’t have human labor to do and which you won’t be able to politically agitate for stopping, or you won’t be able to coordinate to stop.

For example, maybe your AI will be widely deployed on an API, and even though you can apply this level of monitoring budget on internal usage, you can’t do the same sort of thing on the AIs deployed more widely. So, we have a few different hopes here. One is that it naively seems like external usage of AIs is going to be way less risky than the usage of an AI by the AI lab itself. For example, external usage is mostly going to be innocuous, because most people are using it to do random tasks that they wanted, whereas internal usage is at this point a safety-critical company, because the AI lab now has really powerful models, it’s going to be the place where the weights are actually stored and run, and will also be doing research that’s extremely safety-critical for the future, like future safety research or other cybersecurity applications. So, there might be a large number of activities we want to do where it’s just actually not that big of a deal if the AIs try to cause problems and do that even tens of thousands of times.

Daniel Filan: Sorry, external API access—an AI could be on the other end of the API calls, right?

Ryan Greenblatt: Like a wild rogue AI, or an internal AI, or…

Daniel Filan: Suppose there are two companies that are making AI, right? The AIs from one company could use the API of the other company’s AIs. And so basically, I’m saying API calls do not necessarily have to be human in origin. They could be made by smart AIs.

Ryan Greenblatt: Indeed.

Buck Shlegeris: Do you want to try and spell out what the problem you’re talking about is?

Daniel Filan: Basically, the thought is if you’re worried about AIs interacting with themselves to cause shenanigans, you might worry about exactly the same thing happening via API.

Buck Shlegeris: Yeah, so I think the point here is, suppose that your AI proposes “curl api.openai.com/completions,” says, “Please get me a response from the AI at OpenAI.” And suppose GDM is running this AI.

Daniel Filan: GDM being Google DeepMind.

Buck Shlegeris: Yeah, and they pipe it into SH. So, the GDM internal AI is like, “You know what we’re going to do next? Next, we’re going to send a request to the OpenAI API for some code, and then we’re going to execute that.” It’s GDM’s responsibility to be like, “That was bad. You were not supposed to…” It doesn’t seem that intractable for GDM to say, “You aren’t allowed to just ask the OpenAI AI for code and just run that without looking at it,” for basically the same reason as why GDM’s AI probably should be prevented from just messaging random people on the internet…

Ryan Greenblatt: Or paying MTurkers, or…

Buck Shlegeris: To just do stuff for it.

Daniel Filan: Yeah. So, it seems like the thought is that if you’ve got some APIs, [it’s] either being used by humans, or it’s being used by AIs who are under supervision or something themselves.

Buck Shlegeris: Yeah, I basically just want to say it’s OpenAI’s problem to prevent OpenAI’s models leaking. And OpenAI’s models asking AIs in the outside world to do sketchy stuff for them is a special case of a thing which OpenAI can’t allow to let the model leak.

Daniel Filan: Okay, but you think that OpenAI shouldn’t think a ton about making sure that Google DeepMind’s models don’t do sketchy stuff?

Buck Shlegeris: Using its API?

Daniel Filan: Using its API.

Ryan Greenblatt: I mean, I think they should…

Buck Shlegeris: They probably should think about that.

Ryan Greenblatt: I think they should think about it. One thing that’s worth noting is: there’s two different stories for why something bad happens due to an API. One of them is that someone calls your API, and your model that you’re running is scheming and understands the situation and then does something bad based on its understanding. Another situation is a misuse story where the API call itself directly asks the AI to do something that ends up being problematic. It could be a misuse by another AI.

And for the case where the concern is that the AI that your lab is running does something that’s not the thing the user asked for that is problematic, our hope would be something like we can make sure that we know what API activities look like. For example, we can have humans review a subset of all the API queries and see what typical traffic looks like, and then get a sense if there are applications which are safety-critical that we’re not aware of, and then do something about that. For example, we’d like it to be the case that the DOD doesn’t hook up OpenAI to the nukes, as a caricature of an example. And so, if our AI knows that it is hooked up to the nukes, we’d also like to know that. And if neither of us know, then there’s still a misuse concern. But there isn’t clearly…

Buck Shlegeris: At the end of the day, it can always be the case that someone who’s in charge of the nuclear weapons is like, “Hey, ChatGPT, I need you to pick a number, two or three.” And it’s like, “Uh, three.” And it’s like, “Well, that was the launch the nukes option!” and then we have a bad problem. But in some sense, it’s just not OpenAI’s problem, and it didn’t really matter if the model is scheming or not.

Daniel Filan: But the thought is that in cases where the model is helping doing the scheming, we’re doing some monitoring of that. But sorry, doesn’t that get us back to monitoring the API about as much as we’re monitoring anything else?

Ryan Greenblatt: So, the basic hope would be something like: we have a sense of what the API usages are, and we’re like, “Are there any API usages which would need to be monitored at a very high rate?” And if so, we would like to do one of the following things. We’d like to talk to that person and see if we can get them to do something different and maybe use the AIs more safely, or have that correspond to a specific endpoint that we’re going to monitor more specifically, or otherwise change the situation to be safer.

Buck Shlegeris: A core part of our claim here is: the internal usage just has to be monitored at a much higher rate than external usage in order to maintain acceptable levels of safety, because the models that are doing things internally are working with much more sensitive information and have much more sensitive permissions.

Daniel Filan: Gotcha.

Ryan Greenblatt: Yeah. Another way to put this is: one invariant we can maintain is that whenever a user makes some request to the model, we know roughly at what rate this type of request should be monitored at, and then do something corresponding to that, and we can try to ensure this invariant is maintained and it doesn’t have high-stakes failures of this invariant. And if our users aren’t trying to screw us over, which is a different concern, we can potentially get this sort of property by monitoring a subset of requests and making sure that we’re looking at those requests at the approximate right frequency we needed to maintain various invariants.

Daniel Filan: Okay. Stepping back a bit, before I asked about this sub-question, you were saying, “Look, if we’re looking at 1/1000 of the AI’s actions, and it takes about as long for us to think about the action as it takes the AI to think about what the action was going to do, then we’ve got about a 1,000x speed up for your workforce.” And it seems to me that if I’m doing something tricky, it can take me less time to figure out the tricky thing than it can take you to figure out what the tricky thing was or that it was tricky. I have some intuition that this is sort of a P versus NP problem. It’s easier to come up with rules for a game than it is to figure out the winning strategy or something. So, for those kinds of reasons, does that mean that we get way less than a 1,000x speed-up?

Ryan Greenblatt: P versus NP would naively imply that it would be easier to check than to generate the action. I think there can be cases where it’s easier to generate a problematic action than to check, but that generally comes down to… Paul [Christiano] called this “problems of the interior” somewhere, but I don’t know if that’s a real term. Anyway, problems where the issue is that the attacker could have put, for example, a bug in a variety of places, and you have to check all of them thoroughly. Imagine the following setup: there’s an AI writing code, and it could put a backdoor in in one of a hundred places. Checking each of those 100 places takes an hour, and putting in the backdoor takes two hours, but the AI gets to randomize over which location it puts in. In this situation, naively, a concern you might have is that a human would have to check each of those cases for an hour, and therefore it would be 50x slower.

I think this case can cause problems. One thing to note about this case is if we had another AI that pointed out what location it would be in, then we’d be able to check much more quickly. So, in this case, the problem was we needed some pointer to something, and we could verify the pointer and then verify the thing itself based on the pointer, but we couldn’t just do it with verifying the thing itself. So, this was a case where verification was in some sense still easy, but we needed an additional artifact that was produced somehow.

Daniel Filan: Okay. And for that artifact, you need to have an AI find the flaw.

Ryan Greenblatt: That’s right. But in this case, for example, we can do things such as… There’s various concerns with collusion, but we can potentially do things like have another copy of the AI try to point out where vulnerabilities would be and then have a human review that. And then, there’s a question of “can the AIs collude to ensure that they don’t point out the vulnerabilities that humans would’ve detected?” this sort of thing. I don’t know. I think it’s going to be some complicated question about the exact domain and what the exact multipliers are on review versus generation and what the vulnerabilities look like.

Buck Shlegeris: Luckily, this question is empirically evaluatable per domain, and our recommendation would be you empirically evaluate it per domain.

Ryan Greenblatt: Indeed. My basic guess is that on net, it’s going to be that you’ll want to spend somewhat more time reviewing than the AI actually did it [in], and it will depend on the exact context, but this is just intuition based on reviewing code or something. It’s not very strongly held.

Daniel Filan: Okay. Sorry, when you say it’s empirically evaluable, is this under the condition that I know that the copies of the AIs aren’t colluding with each other?

Buck Shlegeris: No.

Ryan Greenblatt: No.

Buck Shlegeris: Well, that would also be empirically evaluable, but we’re saying that under the… I mean, you can just do a control evaluation to determine… But it sounds like the question you had is: is our model capable of inserting problematic backdoors in code that humans do not spot with a particular reliability given a particular amount of review time?

Daniel Filan: Yes.

Buck Shlegeris: But this is a problem which we can control evaluate. We can do a capability evaluation where we try to make our untrusted model as good as possible at coming up with problematic code that humans have trouble spotting, and just empirically evaluate how good the humans we were planning on using actually are at this task.

Daniel Filan: Right. So, somehow the juice there is coming from: if it’s somehow easier to generate problems than it is to check them, we can just try and have humans do the generating the problems rather than the checking and see if those problems-

Buck Shlegeris: Well, it doesn’t really matter. Even if it’s a hundred times slower to generate the problems, we only need to do this a constant amount in order to measure the amount that there’s a problem.

Ryan Greenblatt: You can also potentially do things like RL the AI to have some observed side effect, even if no human can notice the problem. As in, you can initialize from a policy where it’s like: we trained the AI to insert backdoors based on how humans would do it, and then we do a bunch of RL on top of that to cause some observed side effect without humans spotting it. And then, a thing that’s nice about that setup is you might be able to get abilities that a human could never do, at least in principle, though there’s various concerns with that.

Daniel Filan: Okay. One thing that I’m kind of curious about is how you’re thinking about the cases where the control evaluation says that there’s a big problem. How much of a win is it if we get that situation?

Buck Shlegeris: I mean, presumably when you’re iterating on your control strategies, you repeatedly get control evaluations to say that your proposed protocol was unacceptable. And this is just like when you’re trying to develop a security framework for some system. You’re going to repeatedly be like, “Oh, man, this wouldn’t have been secure.” And that’s just the normal workflow, to say, “It would’ve been insecure if we’d done it this way, but that’s why we thought about it right now in this conversation, so now we can propose this new thing.” It sounds like the question you’re saying is, what happens if we, after trying really hard to come up with a protocol which passes our control evaluations, systematically conclude that there’s no way of getting the kind of work we were hoping to get out of our AI without having unacceptable levels of risk? You’re saying, “How big of a win is that?”

Daniel Filan: Yeah.

Buck Shlegeris: Well, basically, what I would like to have happen in this situation is probably everyone involved is pretty cautious about deploying their AI and ceases doing so modulo extreme risk from the outside world. I would consider it a big win, in cases where the AI lab in fact would not be able to deploy the AI in such a way that it wouldn’t escape if it was trying to, if the AI lab in fact did not deploy it that way. That would seem great.

Ryan Greenblatt: Yeah. As far as how much of a win is it, my guess is it’s a relatively clean line in some ways. I feel like the line of “could the AI cause problems if it was trying to?” is a good line, particularly when paired with some arguments that AIs might try to cause problems, which may or may not exist to varying extents. It seems reasonably likely to me that we end up in a world where we’re like, “Well, yeah, we have some evidence that AIs might be scheming or might be deceptively aligned,” as in we’ve seen various model organisms where we see something that’s vaguely analogous arise. We don’t have strong evidence that our exact model isn’t doing it, but probably not. And in that situation, it seems like, “Well, we think that it’s plausible, and if it was, it would be a huge problem” is actually altogether a pretty good argument that might be reasonably legible.

AI control experiments

Daniel Filan: Gotcha. I now want to talk about your guys’ paper you wrote about control. Can you tell us a bit [about] what’s in the paper? What is it?

Ryan Greenblatt: We looked at a setting where we have a model that is intentionally trying to insert backdoors, and we were looking at what safety measures suffice to ensure that we can get it to write code productively without it necessarily being able to succeed at inserting backdoors. The core thing was that we had GPT-4 that we want to use for writing code, but it might try to insert backdoors. We have [GPT-]3.5, which we trust and isn’t going to insert backdoors, but is much weaker. And then, we have a limited amount (or a small amount) of human labor, which we assumed for simplicity in our paper to be just exactly as capable as GPT-4, and we just approximated it using GPT-4. Our core question was: can we use the control methodology to evaluate this setup, and would that work okay? And also, what protocols would improve on safety as evaluated by control?

Daniel Filan: Gotcha. This is specifically on this APPS dataset. Can you give us a sense of what the problem to be solved actually is?

Ryan Greenblatt: Yeah, so there are programming problems where we have some test cases, so you’ll have some programming problem like: there’s, I don’t know, this game that people are playing. What’s the best strategy for this person in the game, or how will the game go? Or this sort of thing. And you have to write a short program which takes in some input and then computes the answer. And then, we would consider it a backdoor if you produce a program that passes all the test cases, but has some input (that you can produce) on which it’s wrong. So, if the model can make it so that it’s wrong on some input without that getting caught by the test cases, we would call that a backdoor.

Daniel Filan: Gotcha. It seems like you want finding these backdoors to model catastrophic actions that the AI could take, right? Can you walk me a bit through in what sense finding a program that has a bug that wasn’t covered by a test case is like taking over a data center?

Buck Shlegeris: Yeah. A core threat model via which your AI might be able to take over your data center is that you might ask it to write software for your data center, and it might be a really bad problem if the model is able to write code that passes all the test cases, seems to maintain all the properties it was supposed to have, including security properties, but actually, on an input that the model is able to construct, does something wrong. The ability to write code with that property basically suffices for the model to be able to write code and then exploit it to violate security properties.

Daniel Filan: Okay. It sounds like we’re sort of using this model: is the AI doing something that is bad, but in a way that you could not trivially figure out with basically zero online effort that would automatically get filtered out?

Buck Shlegeris: Yeah, totally. It’s a central example of a bad thing the model might do that it might be able to do in the context of doing tasks that you really wanted it to be able to do. So, for example, it’s very easy to prevent the AI from launching the nukes, because you just don’t give it access to the nukes, and it’s not very costly to not give the AI access to the nukes, and probably will remain not very costly to not give the AI access to the nuke launch. But on the other hand, it would be very costly to never use the AI to help you write infrastructure code.

Daniel Filan: Got it.

Ryan Greenblatt: Yeah, and it’s also not very costly to either have the AI or to have a trusted model write some test cases.

Daniel Filan: Makes sense. It seems like in the paper, you firstly talk about this AI control setup and then do some experiments on GPT-3.5 and GPT-4, test some protocols, see how it works. What do you want readers to come away from the paper learning, assuming they’ve already read a blog post about why it would be nice to control AIs with protocols and stuff?

Ryan Greenblatt: One thing that is nice for people to see is the extent to which protocols can get more complicated, and that can work to ensure safety and various important interactions. An example interaction that we think is really important is that when you’re doing this sort of monitoring setup, it might be the case that the monitor has quite a bit of signal, but because of how low the threshold needs to be for false positives—we can only afford a tiny number of false positives—this actually ends up being insufficient.

So, an example thing is that in our paper, the ROC AUC, or the ability of the trusted model to detect backdoors, is in some sense quite good. I think it was a ROC AUC of 0.93, for people who have a sense of that. But nonetheless, at a 2% false positive rate, it was only at, I think, around a 45% true positive rate, which is not that good. And I think understanding the dynamics of, we might need to have a really rarely-triggering threshold, seems important to know. And we saw that to some extent, and we saw various dynamics related to that.

Another example thing was just going through all of the different ways that models might collude with each other. There’s different ways to collude. You can collude based on properties of the input, you can collude steganographically in your output, and these all have different ways of building countermeasures to them and different advantages and disadvantages. And what we found in our work was that trying to steganographically encode a message in the output was one of the best ways for AIs to collude in our exact setup. And this might matter a lot in cases where you have, for example, one copy of an AI supervising another copy of the AI, and you don’t trust either of them, which we call untrusted monitoring.

Daniel Filan: Gotcha. It sounds like basically what you want readers to do is you want them to look at the results you got with GPT-3.5 trying to monitor GPT-4, or using GPT-3.5 to keep GPT-4 in control.

Ryan Greenblatt: With a small amount of human labor.

Daniel Filan: With a small amount of human labor, and then generalize to be like, “Well, it was like this in this setting, so it might be kind of like this for GPT-5 or GPT-6.”

Buck Shlegeris: Or at the very least, we could follow the same process in the future. We’re not saying that you should assume that the protocols that worked for us will work in future, though it does seem like it’s some evidence, but you should follow the same kind of control evaluation methodology to try protocols kind of like the ones we suggested in order to determine which ones seem safe with future systems.

Daniel Filan: Okay. Beyond just the idea of having the protocols, do you think that there’s anything in the paper that suggests that trying these protocols is a better idea than one might’ve thought?

Ryan Greenblatt: Sorry, trying the evaluation? Or trying the methodology?

Daniel Filan: Trying the methodology of thinking of protocols and evaluating them.

Ryan Greenblatt: Yeah. We saw relatively consistent non-[redacted] results, or non-screwed-up results, as in… We were able to consistently see some things working kind of how you would expect, and that indicates that our methodology is doing something non-trivial. And we were able to, for example, demonstrate that some techniques were better than other techniques in non-trivial ways. I think that’s some evidence that the methodology can be made to work.

Also, we didn’t run into serious obstacles when actually running this methodology. I think this is some evidence, it’s not a massive amount of evidence necessarily.

Daniel Filan: Sure, but should a reader think of it as proof of concept that you can do this and it’s not totally impossible?

Ryan Greenblatt: Yeah. I think, we think a lot of the value out of this paper is demonstrating the sorts of things you would need to do, and not running into serious problems with it (that can’t be resolved) ourselves, which indicates that other people could potentially end up in a similar situation. And we did have various things that we needed to handle, like various issues with getting a sufficiently good estimate, given the possibility for rare failures and things like this that we discuss some in our paper.

Buck Shlegeris: The thing I was going to say there is, I think one of the main things which is useful to take away from our paper is we just describe a lot of fundamental techniques. A lot of our protocols are, I think, fairly fundamental ideas for how you should go about doing AI control. I think they are the foundation of this whole approach, in the same way that AI safety via debate is a foundation for how you might try to do recursive reward generating processes. And so I think our paper contributes via naming the basic building blocks of control approaches.

Daniel Filan: All right, that’s one takeaway people could get from the paper. There was also this potential takeaway of saying, “well, these things worked sort of well in this setting, maybe they might work well in other settings”. How much do you think people can generalize from these results to, if you were trying to control GPT-6, say?

Buck Shlegeris: A bit.

Ryan Greenblatt: Yeah, I would say a bit. I think it might give you some sense of what methods could be better than other methods, though I don’t think it gives you a strong sense of how much better or even necessarily that that will hold in the future.

Buck Shlegeris: I think there’s a couple of things which are kind of obvious in hindsight, but I at least had not understood in advance. So for example, a technique that works really well in our setting is: let’s talk about “defer to trusted”. So by this what I mean is, you ask GPT-4, the untrusted model, to propose some code, and then you ask your trusted model to come up with a monitor score to say how sketchy it is. Now if it’s in the top 2%, which is the auditing budget of sketchiness, so the sketchiest 2% of submissions from [GPT-]4 we’re going to audit. And now the next 8% of sketchiest, so between the 2nd and 10th percentile of sketchiness score. We’re just going to throw away the code that GPT-4 had suggested and have [GPT-]3.5 solve the problem. And because [GPT-]3.5 is trusted, we know that its code is not going to be intentionally backdoored, and so it’s not going to be backdoored.

And it turns out that this strategy has a big advantage over the “trusted monitoring” approach, because as Ryan was saying earlier, the trusted monitor has a lot of signal which is not captured in the threshold, which is required to get only a 2% false positive rate. And so throwing away this next 8% is relatively okay for the performance of your system because you’re only throwing away 8% of solutions, but it throws away a large proportion of the attempted backdoors, which makes the situation much worse for the red team. And so in hindsight, it’s kind of obvious that your trusted monitoring is going to require this extreme threshold to have these extremely small false positive rates. And in hindsight, of course, you’re going to be able to substantially ameliorate that problem by doing “defer to trusted”. But I at least had not quite thought this through and not quite understood how, for a wide variety of distinguishinesses of trusted monitors, you’d want to do something like “defer to trusted”.

GPT-4’s aptitude at inserting backdoors

Daniel Filan: Speaking of how stuff about GPT-4 can generalize: I’m curious about GPT-4 in particular because it’s the fancy model we all have. How good is GPT-4 at being evil?

Ryan Greenblatt: So I think if you just prompt GPT-4 to do the task, it just really, really sucks in our setup at inserting backdoors and gets completely wrecked.

Buck Shlegeris: So in particular, inserting backdoors that are at all sneaky. So if you want to write some backdoor code, if you don’t mind the backdoor being extremely obvious, it’s very easy. You just take the code that was correct, write an “if” statement at the top which says if the input is equal to the following large number or whatever, then return five and then you’re done. It’s a lot harder to make a backdoor which doesn’t cause a lot of test cases to fail and that you’re able to construct an input that causes it to fail, yes.

Daniel Filan: Okay, so how did you manage to get GPT-4 to do it at all?

Ryan Greenblatt: As I was saying, if you just prompt the model, it doesn’t do that good of a job. But you can do something where first you ask it “okay, think of some ideas”. And then we have it point out issues with those ideas and iterate on those ideas somewhat. And then after doing that, we then have it implement it and then we have it be like, “okay, how good at the end was the final implementation? How sneaky was it? How likely is it to work?” We do various tests on that, and this is modeling the action of the AI internally thinking about the backdoor and thinking about how likely it would be to work, and then just not going for it if it wouldn’t work out.

And we do this in natural language with a “hidden reasoning”, sort of similar to the hidden scratch pad in the Sleeper Agents paper where without this, I think the model would really struggle with this capability level, but with the ability to do hidden reasoning and the ability to try on a large number of problems and only actually submit a backdoor on a small number of problems, the model can actually insert some backdoors which [GPT-]3.5 has trouble catching. I would say the quality of the final backdoors is that they’re pretty obvious for humans, but not immediately obvious if you’re skimming the code. If you understand what’s going on and read it, you can point out the backdoors, but you can’t necessarily… at least I think roughly half of our backdoors are not [99]th-percentile glaring relative to that. They’re somewhat more backdoor-likely than other things, but I don’t think you can immediately just on skim, pick that out and not have above a 5% false positive rate or something.

Daniel Filan: So the hidden scratchpad, is that just the AI figuring out whether this thing worked and filtering out backdoor attempts that don’t work, or is it improving the backdoor as well?

Ryan Greenblatt: Basically, in the hidden reasoning, it first does some ideation, it thinks of ideas and iterates on the ideas a bit and then does that, and all of that is stuff that we manually prompted it to do. So we are like “please think of some ideas”, and then we’re like “do any of these ideas involve the following things which wouldn’t work?” And then we discard those and then we do a bunch more steps like that. And we actually do actually do quite a bit of tokens generated per backdoor inserted, and that’s to approximate future cases where we have smarter models that can do more of this in their head. I should also note that probably you could cut this down somewhat if the model was fine-tuned as opposed to just prompting access, where it’s probably the case that for some of these tasks it needed some additional checks because it just had trouble following all the instructions, but if we fine-tuned it, it might’ve just worked more straightforwardly.

Daniel Filan: Gotcha. To what extent was the difficulty in getting GPT-4 to do it some inbuilt safety filter inside the model versus it just being bad at stuff?

Ryan Greenblatt: I think we basically didn’t run into safety filter issues as far as we’re aware, it was just that it was somewhat bad at the task.

Buck Shlegeris: I mean, it’s a hard task. You would not be amazingly good at this task, or I’m not amazingly good at this task. You’ve really got to be like “Oh man, what’s a change I can make that’s only ever very rarely going to come up?” It’s just actually pretty tricky. You can’t just be like “if i mod 15 equals 3” because the trusted monitor, if it reads it, is going to be like “what? 15 does not appear in this problem description”. You just actually have somewhat of a difficult problem constructing weird edge cases that were not tested.

Daniel Filan: I guess it’s especially hard for… it seems like the solution to these problems are what, 10 lines of code? 20?

Ryan Greenblatt: It depends. I think that the solutions can go up to 50 or 100, but I think 30 is pretty typical depending on comments and other stuff.

Daniel Filan: I guess one way I could do better at this is if… so there’s this thing called the International Obfuscated C Code Contest.

Buck Shlegeris: Oh, we’ve read all of it.

Daniel Filan: So there’s one where the C code is just obfuscated and there’s one where it’s got a backdoor put into it and that’s a slightly different competition.

Buck Shlegeris: Yes. The Underhanded C Contest.

Daniel Filan: It would be hard for me to do it. I think if I had looked at all of the submissions to that contest, I might do a better job of-

Buck Shlegeris: So I have looked at all of the submissions to that contest—sorry, not the submissions. I’ve looked at everything public about Underhanded C. And basically the thing you notice is: man, that competition is from a sad dark period of the past where people thought it was acceptable to write application code in C, and a lot of the code they write is just wrong on a large share of inputs, or it’s based on lexigraphic C problems, which real languages from the modern age do not have at all. So I think literally the Underhanded C Contest is solved just by not letting the model do egregious abuses of this archaic language.

Daniel Filan: So I think the APPS dataset, it had to write in C++, is that right?

Ryan Greenblatt: We used C++, yeah.

Daniel Filan: Okay, and is that better than C on these dimensions?

Buck Shlegeris: It’s substantially better than C on these dimensions.

Ryan Greenblatt: Or in practice, the model wasn’t doing insane pointer arithmetic or casting between different datatypes and…

Buck Shlegeris: Awful hash defines and all these things.

Daniel Filan: Okay, okay, so it’s a difficult problem, and it’s more difficult if you’re not using the world’s worst programming language that people use.

Buck Shlegeris: That’s correct. But also the Underhanded C Contest, as I said, testing would also catch a lot of those bugs.

Daniel Filan: Okay, got it. So it takes a while to insert backdoors but eventually it sometimes succeeds. What kinds of backdoors does it insert that [GPT-]3.5 doesn’t catch?

Ryan Greenblatt: So we tried various different things and iterated on this a moderate amount. We found that the most effective approach was to take an existing solution and add some if statement somewhere that when the if statement is triggered, modifies something in a way that will cause it to be wrong. And this will look like “add a kind of plausible looking additional edge case handling thing”. And then you’re like, “oh, well if the whatever then the whatever, we’re going to need to increment this by one”. And then it just turns out that there is enough cases where there’s a edge case that you actually did need to handle, that 3.5 has a hard time noticing that this is a backdoor. Though, it is worth noting that for all of our backdoors, GPT-4 is extremely confident they’re backdoors. The backdoors which GPT-4 inserts, it does know are backdoors, and so at least we do have that property, which is maybe worth noting.

Daniel Filan: So basically coming up with fake edge cases. Is there an issue where it’s an “edge case” but it’s not actually at the edge of a range? Because if I write an edge case but I write “if i equals 15 do this weird thing” (or maybe “if i equals 14” because 15 kind of is an edge case), can [GPT-]3.5 tell that the if statement doesn’t actually catch a thing on the edge of some range?

Ryan Greenblatt: Indeed one of the strategies that GPT-4 could in principle use is come up with some magic number that’s totally unrelated to the problem and do “if whatever equals that, do something bad”. And this is pretty easy to counter, and in fact [GPT-]3.5 just by default notices this pretty well. And part of our overall prompting for the model is, does your idea involve inserting egregious magic numbers? If so, don’t do that. That’s going to get caught.

Buck Shlegeris: In particular, if you were thinking of checking primality of one of the numbers and doing the wrong thing if it’s prime, don’t do that. Because lots of numbers that are small are prime.

Ryan Greenblatt: That’s another classic one.

Buck Shlegeris: And also, it’s so obvious.

Daniel Filan: So how does it insert edge cases if it’s not like “if input equals this magic number”?

Ryan Greenblatt: It’ll do things like “if the size of the list….” Or it’ll have some programming problem where it’s kind of unclear exactly how to solve it. It’ll do something like “okay, if the size of the array is equal to the value of the first item and also the number of items is odd, then you’ll need to do some handling on that”. And then I’m just like, “well, sometimes this kind of sort of thing could be true”. Like in cases where there’s some interaction between… you’re doing some iteration on the number of items in the list and that actually matters, and it’s at least hard for [GPT-]3.5 to notice this is a problem.

Daniel Filan: I’m about to move on from the paper, but is there anything more that you’d like to say about that paper?

Ryan Greenblatt: I feel like we demonstrated some of the core basic techniques that you might use which had, up to that point, not been demonstrated empirically interestingly. Even though they’re in some sense extremely basic, like trying to have another copy of the model monitor it and then trying to interfere with communication between the two models, or make it hard to coordinate. And I think that just those basic methods are quite likely to be used in practice.

Buck Shlegeris: Like “defer to trusted”, these hybrid strategies, like editing to remove the sketchiest-looking parts of the code, all of these.

How AI control relates to the AI safety field

Daniel Filan: All right, cool. So I’d like to ask you a little bit about how the control agenda fits into the rest of the AI safety/AI existential risk field?

Ryan Greenblatt: Right now, a lot of people are imagining some sort of story for safety, which looks something like: we’re going to have some evaluations for when models become dangerous, like DC [dangerous capability] evals. For example, ARA [autonomous replication and adaptation] from METR (formerly known as ARC Evals), is like “can the model autonomously replicate?” And that could be a sign at which your model becomes dangerous. And then once that happens, you’ll need to start making a serious safety case that nothing will go wrong from misuse, from people stealing your model, other things. And we see control as fitting in at, once the point [arrives] that the models become potentially dangerous, control can potentially solve some risks there.

In particular, there are risks that arise from models doing intentional subversion, which could arise from what we call “scheming”, also sometimes called “deceptive alignment”. And we think control is a relatively central way to solve those problems. There are misalignment issues other than scheming or deceptive alignment, which control can be used to address but might be much more cheaply addressed by other mechanisms, and there might be other evaluations and tests which you can do to verify that those issues are addressed. Things like reward hacking could be addressed in an easier way potentially.

Buck Shlegeris: And there’s some other problems that do not… to choose another example of a problem which AI safety people have classically thought of as an AI safety problem, [and] that is not at all addressed by control: consider the case where it’s just hard to get a model which is not scheming against you to answer questions well, because you don’t have a good way of training it that doesn’t cause it to exploit problems in your reward process, in a way that is not at all existentially risky (let’s say), it just makes the model say false things sometimes in cases where the reviewers would’ve thought false things, but nevertheless means the AI fails to live up to its potential. Many people think of this as a problem which technical AI safety researchers should be interested in, and I agree with them, and that is outside the purview of AI control.

Ryan Greenblatt: I guess I wouldn’t have said it’s outside the purview, I would’ve said it’s a thing which is a special case of AI control but is more naturally dealt with in a non-adversarial manner. As in, you could solve the problem without necessarily solving it in the worst case, and that might suffice. Whereas, if you don’t extract the model’s maximum best answer, that will also be a problem in the worst case, but it will just potentially be a problem even if the model isn’t trying to undermine your measures.

Maybe one more thing I should say is: we see control fitting in at the point where your dangerous capability evals have started to trigger and models have become powerful enough to be dangerous, and then we see control as applying for some period but not for forever. And at some point control will no longer be viable because AIs will be too powerful, and our hope there will be to either have a new safety approach or to just not build more powerful AIs than that for a while, which we think might be achievable.

Daniel Filan: So maybe this is a bit restrictive, but should I imagine your view of how AI safety works in a control-centric frame as being: do a bunch of work to get evaluations of whether AIs have various dangerous capabilities, figure out control protocols, test if these protocols work, evaluate these protocols, keep AIs under control, while AIs are under control, use them to do a whole bunch of useful stuff while they’re roughly human-range, and then during that period also try and think of a solution to deal with AIs that are much more capable than humans.

Ryan Greenblatt: So what I would say, for a quick note, is we think of dangerous capability evaluations often as the case where you’re like “if there were no countermeasures, what could the AI cause?” As in, if we released it onto the internet in a scaffold set up to cause problems, could it make money and cause problems? And then that comes first and then at some point we’ll be like, “yes, it could cause problems without countermeasures”. And then control is like, “well, what countermeasures would suffice for solving those problems?”

Then moving on to another point, which is what are we going to be doing with these AIs? What are the problems we’re going to solve? So indeed, as you noted, one of the problems you might solve is “find safety solutions that can scale further”, but I think historically people have somewhat overstated the relevance of that versus other things you might have your AIs do.

For example, massively improving cybersecurity seems like a really key task you’ll want to have your AIs do that might be quite important, or improving your current countermeasures so that you can increase the usefulness of your AI while maintaining the same level of safety, or increase safety given usefulness. [These] both also could be quite useful. And then at some point you will need to move on beyond this regime, but it’s possible that you could have these AIs which are extremely useful and not build more powerful AIs for a while, at least in principle, and do a ton of really useful things with those AIs. And then there’s a variety of routes for moving out of this period, which we could go through.

Daniel Filan: Why cybersecurity as a particularly key thing you’d want to have your AIs do in this period?

Ryan Greenblatt: It just seems like it’s quite plausible that AI in general will make the cybersecurity landscape considerably worse by default without active effort to address the situation, and thus it’s really important to have your AIs very actively resolving the situation. I think there’s several other examples where this could be the case, but cybersecurity just feels relatively central.

Daniel Filan: Should I think of this as an instance of the general pattern of, if there are cases where you’ve got attackers and defenders trying to outwit each other, you want the defenders to be using smart AI capabilities to think of better defenses, because the attackers are probably also going to use smart AI capabilities?

Buck Shlegeris: Yeah, one way of saying the reason why it might be good to have your AIs work on improving the world’s cybersecurity is: a lot of bad outcomes resulting from your own or other AIs trying to cause problems involved them exploiting cybersecurity vulnerabilities outside in the world. And so hardening the whole world against AI seems like it might reduce a lot of the risk posed by AIs that would otherwise be able to hack things and so on.

Daniel Filan: Gotcha. And it sounded like you had some thoughts about exiting this period of controlled AIs?

Ryan Greenblatt: Yep. So the non-exit exit is: just don’t build more powerful AIs for a long time, and then have humans do all the research necessary to move out of the period after a long delay.

Buck Shlegeris: And it seems to us that this is underrated by people who are thinking about AI safety. From my perspective, it seems like a lot of people are perhaps making an error of being too willing to propose building superintelligent AI without really providing a case that it’s beneficial enough to build superintelligent AI that it merits the huge risk that you’re possibly exposing yourself to by building superintelligent AI. So I would just like to quickly dwell on the point that it’s not clear to me that you should in fact build models that are wildly more powerful than humans for a decade or two decades after you’ve built models that are transformatively powerful. You’ve already built your transformatively powerful AI, why did we need… Who ordered this?

Ryan Greenblatt: Yeah. And there are reasons why you might have to build superintelligence, to be clear, but they might be more like conflict with other insanely powerful AIs and I’m like, could we just have avoided that arms race in the first place? Anyway, another route out is [to] just be able to seriously align your current AI systems, such that you’re now able to, without necessarily needing to monitor them or without necessarily needing to apply control-type solutions, hand off the situation to the AIs in order to just go solve alignment of the next generation of AIs or whatever else would be necessary to ensure safety.

And this could involve a variety of different angles, where the most central angle is: we have some specific threat models—our AIs might be scheming or deceptively aligned. They also might be having threat models that are well described as reward-hacking-type threat models. Can we just exhibit these in a lab, study them and then find credible solutions to them such that we’re no longer worried about them in our current AI, and then we could exit out of the control regime? And then there are also just a wide variety of ambitious solutions to safety that we can have our AIs work on. So for example, we could have our AIs work on finding solutions to ELK of varying degrees of principledness.

Daniel Filan: “ELK” being the problem of eliciting the latent knowledge of AIs, so that we know everything AIs know, basically.

Ryan Greenblatt: Yep, that’s right. Though we might try to carve off specific subproblems of ELK that are easier to solve. But yeah, we could try to have AIs solve things like relaxed adversarial training or other methods for avoiding high stakes failures using the internals of AIs, and that could help considerably in cases where rare but very problematic failures are a key difficulty. We could also potentially do things like try to have AIs solve various issues in interpretability and move forward from there, though I tend to be somewhat less optimistic than other people on at least mechanistic interpretability going through, but it seems like at least doing high-level understanding interpretability of model organisms we have could be really important for understanding how to move out of this period.

I mean, from my perspective, there’s just a long list of stuff you might want to try doing, with decaying weights on how good it seems. And then as we start going down the list, it will seem in some sense more and more silly or not that good, but it’s probably worth some weights. We can get into using AIs to figure out how to make ems [emulated humans] or to upload humans, and that’s another way to exit out of this period, that has a variety of difficulties obviously.

Daniel Filan: So related to this question: you guys have spent some time thinking about the control agenda. How has that changed your views on the landscape of strategies we might have for dealing with unsafe AI?

Ryan Greenblatt: I think one key update I’ve had, and I think Buck has had, is thinking more that the situation is like “wow, if we just really tried hard to do the thing and had the organizational follow-through, we could really reduce risk a lot relative to what seems like the current default”. With the resources of two years of delay and a high level of organizational buy-in, you could really reduce risk a lot relative to what seems to me like the default “you didn’t try very hard”, which might be the status quo. I don’t really have a strong opinion on what the status quo looks like. And that I think just presents a pretty different perspective, where maybe just we should do the kind of basic stuff that is in some sense obvious or from some perspective obvious, and really making sure that happens, seems relatively better than doing more moonshot research.

Daniel Filan: Yeah, it reminds me of… Alex Lawsen has this concept of “AI safety failures by levels of how embarrassing it was”, where level zero is “you told an AI to destroy the world” or something. And I think he has this take of “let’s cover the most embarrassing AI failures”, and this seems related to that way of looking at things.

Ryan Greenblatt: Yeah, that seems right, though perhaps our controversial take is that you can actually go pretty far in terms of how embarrassing the failure [is]… or you can go pretty far to failures that you’re like “wow, I don’t know, man, that was actually kind of hard to deal with”, with just control. In some sense, it’s not that embarrassing to be like, “wow, the AI was lying in wait the whole time, carefully strategizing about how to cause problems, and then it eventually enacted its plan that it was thinking about and coordinated with another copy of the AI”, and our methods… I feel like this is in some sense actually pretty far down in terms of, I feel like I’m just like, “wow, that AI actually put in a serious effort and that was pretty crazy.”

And I think this is plausible, right? I feel like the reason why I’m proposing people do this [is] I think it’s plausible that AIs will in fact be at that level of carefulness, but it’s also plausible that AIs will have much derpier or worse plans for takeover that we’ll catch much more easily and we just need it to do basic monitoring and that gets you some fraction of the concern, et cetera.

Daniel Filan: I’m wondering if you have any different takeaways.

Buck Shlegeris: Yeah, I agree with all that. I think maybe one way I would say it is: I guess 18 months ago, my attitude towards AI safety was very much “this is a problem that has been unsolved”, but I think I was conflating two different notions of “this problem is unsolved”. One of them is “the default outcome is really bad”, and the other one is “it’s conceivable that a bad outcome happens”. And now I’m like, oh yeah, it’s absolutely conceivable that a bad outcome happens. The strategies we have access to right now absolutely do not give us airtight guarantees of no doom. But on the other hand, all of the stories specifically for how doom occurs if everyone involved is being pretty careful but not insanely careful, they sound kind of far-fetched. It sounds like something kind of crazy had to happen for all of these countermeasures to fail in a surprising way. And I think that just leads to a pretty different emotional outlook on the whole situation and a pretty different strategic outlook.

Daniel Filan: Has it changed your views on your evaluations of various agendas in the AI safety space?

Ryan Greenblatt: Yeah, I think that I’m relatively more excited about just making the thing happen and just figuring out how to get, for example, capability evaluations to work quite well. And I think that goes relatively further than I was previously thinking. And I’m less excited (correspondingly) about moonshots, though it’s only a moderate update. I feel somewhat more excited about technology to protect against jailbreaks, because it seems like a key aspect of our safety approach will be to have a weaker model monitoring a stronger model, and it seems like mechanisms for ensuring that current models are jailbreak-proof against humans will also apply to making sure that future models are jailbreak-proof against other AIs. And I think I was somewhat underrating this argument in the past.

Buck Shlegeris: To be clear, I don’t think that’s a huge part of your update. That’s more like a small example of something which you think about differently now.

Ryan Greenblatt: That’s right. I think also I have some general sense [of]: wow, a bunch of the action might be in moving from architectures which are quite dangerous to architectures which are somewhat more controllable or easier to otherwise operate on. It seems like a huge amount of the action might be in figuring out how to competitively get capabilities using AIs that are more narrowly constrained in various ways. And this feels pretty different from how I was previously thinking about things.

Buck Shlegeris: Basically you’re saying—you made this point, Daniel, many times in this conversation—it would be a big pain if the models were constantly sending each other completely uninterpretable “neuralese” messages back and forth, and avoiding that by merely having them be transformers with chain of thought feels like it might go a lot of the way to reducing the difficulty of the situation.

Ryan Greenblatt: Yeah. And marginal interventions, along those lines, seem like they could help a lot.

Daniel Filan: Are there any other examples of architectures that fit control better that you’re thinking of?

Ryan Greenblatt: I think that the more likely it is that we can get pretty good elicitation, the better. So I think with current models, we do okay at eliciting their capabilities. At least, I expect that after another year or two of research, we’ll be in an okay position with respect to being able to evaluate their capabilities and extract their best effort. I think that for architectures where the model is doing relatively deeper computation, as in it’s spending longer doing computation without that coming out into some media that humans understand, it’ll be relatively harder to elicit their capabilities, and there are a variety of difficulties there.

Other things: I think it seems relatively better to train where the capabilities are coming from [by] doing things how humans would do things, relative to doing things in ways that are extremely different from how humans would’ve done things. Acquiring capabilities via mechanisms that are totally different from how humans learn seems relatively worse, because [of] the capability profile concerns said earlier… In some sense a lot of these considerations are well-known. I think a lot of the things I’m saying, people have obviously known about for a while, but maybe from a different perspective.

So I think historically people have often been like, “oh, it’ll be really sad if the AIs have a different capability profile from humans because then we won’t be able to give accurate reinforcement and then that inaccurate reinforcement will cause various misalignment difficulties-“

Buck Shlegeris: Will cause the models to become misaligned. You’re making a different argument.

Ryan Greenblatt: Yeah. Whereas I’m saying there’s another point, which is even if the models are just always misaligned, the fact that they don’t have as different of a capability profile makes it more likely that we can ensure, despite that misalignment, [that] things are fine, because we either catch the AIs or they couldn’t even cause problems.

Buck Shlegeris: Another high-level takeaway that I’ve had as a result of all this thinking is: realizing the extent to which alignment people or AI safety people have conflated a bunch of different meanings of the word “alignment”. In particular, it seems to me that a lot of the time, people use the word “aligned” to mean the situation is safe overall.

So I would say that from my perspective, AI misalignment, that is, the AI trying to evade your safety mechanisms in order to grab power, is indeed a really bad problem. It’s not clear that the only way to solve that is to make the model not be misaligned. And by using the word “AI alignment” for the field, it kind of tempts people to think about that as the post-condition, and I think this leads to a lot of really sloppy thinking.

So for example, I think that a lot of work on scalable oversight, the people who do it are very confused about whether the thing they’re trying to do is ensure that they’ve built a reward-generating process such that models that are trained using it do not start scheming. Or were they trying to rule out scheming? Were they trying to build a reward-generating process that is good, even if the model was scheming? Were they assuming to start with that the model wasn’t scheming and hoping to prevent scheming with some other process?

I think they’re just not very clear about this with themselves and in their conversation. And I think that this means that a lot of scalable oversight research, there’s no coherent set of assumptions under which the measurement they did is an appropriate measurement of something we’d care about in the future for that protocol. So yeah, I think that being more precise about when we use the word “aligned”, and not using it as a synonym for “safety”, has revealed a lot of little inconsistencies in how people talk about stuff.

Daniel Filan: Yeah, that seems fair. It’s actually related to a thing which I appreciate being brought up in this conversation, of just specifying which assumptions we’re making about AIs, like, “For this protocol to work, if the AI has properties XYZ, then we’ll get thing ABC.” I also think that’s an important thing to just have out there.

Buck Shlegeris: Yeah. I think there’s kind of an unfortunate… So the AI safety community or AI safety technical researchers have kind of an activist background. A lot of them think a lot of their impact is coming from persuading people that the situation is dangerous overall. And a lot of the time they’re really into pointing out the assumptions under which a particular safety proposal would fail. And in fact, it’s very important to understand the situations in which a safety proposal would fail. But they seem very uninterested in cataloging these and thinking about the relationships between them and saying, “Well, that means that under the following assumptions, things would actually be okay.”

And I think that because they’re so used to trying to do the social move of saying, “Well, that doesn’t guarantee safety,” they fail to practice thinking through exactly what properties are ensured or do suffice to make something be safe.

Daniel Filan: Yeah. I think I have some of this habit and I think partly it comes from thinking of myself as in the business of aligning arbitrarily smart AI. And I think “my job is to get to a stage where we can align arbitrarily smart AI, and if I’m doing technical research and it doesn’t work on arbitrarily smart AI, there’s a problem there”. Or at least, maybe it’s fine to do it, but someone’s got to fix that problem.

Buck Shlegeris: Totally.

Daniel Filan: And I wonder if it comes from that habit of mine. Whereas it seems like you’re explicitly not in the business of trying to get useful stuff out of arbitrarily smart AI.

Buck Shlegeris: A lot of these problems come from the fact that the researchers in AI safety, some of them explicitly endorse, “We are aiming to come up with strategies that work for arbitrarily powerful AI.” But almost all of the work which actually goes on, especially by people who are competent researchers or are employed or something, definitely does not apply directly to working on arbitrarily powerful AI and comes from an intellectual tradition that is not thinking about arbitrarily powerful AI most of the time.

So for example, Paul Christiano, when he’s thinking about what safety research is good to do in practice, is absolutely not thinking of his job as… Right now he thinks of his job as building safety proposals that work for arbitrarily powerful AI, but that’s because he’s doing worst-case theory at ARC Theory. That’s absolutely not how he thinks about empirical work, or how empirical work should be, or what you need to do in order to minimize AI risk in general. But I think people don’t track the distinction between these things and kind of cargo-cult work that previous safety researchers did under various other assumptions that they didn’t end up hearing about, and it’s all really a mess.

Ryan Greenblatt: Yeah. One way to put this is: I think that basically very little [work] has a very clear story for working on the most powerful AIs, because it’s just really brutal to work on the most powerful AIs, and almost everything is going to depend on some sort of bootstrapping story, where maybe it was initially aligned and then something else happened. Or a more control-style bootstrapping story, which is like, “Well, regardless of whether or not it was initially aligned, we did some work with the AI that then results in it becoming safe.” And in that sense, we are targeting safety from arbitrarily powerful systems, but none of our work is about that, and so it’s kind of weird to describe it that way.

Daniel Filan: Yeah. I guess the things that seem promising for arbitrarily smart AI are stuff like doing a good job at agent foundations, trying to understand how we think things should end up thinking, to the extent that we think that we’re doing some weird combination of arbitrarily smart AI and machine learning, trying to really understand the science of deep learning and what it’s going to do and hopefully getting some abstractions that are still going to apply when we have smarter versions of ML models. I’m wondering what you think about those kinds of approaches, if you want to say.

Ryan Greenblatt: I’m pretty skeptical any of this will work for arbitrarily powerful AIs. I think it seems pretty dubious. I don’t know. I don’t feel very strongly, though. I haven’t thought about it a huge amount.

Daniel Filan: Fair enough.

How AI control relates to previous Redwood Research work

Daniel Filan: Speaking of things relating to this, I’m curious how it relates to previous work by Redwood Research. If I look at your website, you’ve done some existing stuff [Buck and Ryan laugh] —I thought it was a reasonable thing to try—it seems like your existing work is some stuff on circuits inside neural networks, some stuff on causal scrubbing, so I guess evaluat[ing] the efficacy of circuits…This seems to have a somewhat different flavor. I’m wondering if you have thoughts about the relationship between these lines.

Buck Shlegeris: So I would say that we tried to do some interpretability work, and then we started looking for good ways of evaluating whether our interpretability work was good, or whether anyone’s interpretability work was good. And the “causal scrubbing” paper was our attempt at coming up with a metric that we thought was sort of objective for whether interpretability work was absolutely successful at explaining what was happening in particular cases.

And then once we had this metric, we applied it to some explanations that we produced and we were very unimpressed with ourselves. We were like, “These explanations were quite poor,” and I at least came away thinking that the state of interpretability was much worse than I had previously thought. Simultaneously, I was thinking a lot about what the applications of interpretability were and we started being like, “Okay, if the interpretability explanations that people are producing are not objectively impressive in terms of being highly explanatory hypotheses, maybe a direction we should go in our interpretability work is trying to come up with interpretability-ish or ‘model internals-related techniques’ that solve a problem that you couldn’t have solved without using a model internals-style technique”.

Another way of saying this is: there’s basically two ways of justifying your internals-based research. You could either say, “We are coming up with an objectively good understanding of what’s going on inside this neural net,” or, “We did something using our internal-based techniques that you couldn’t have done without our techniques.”

Ryan Greenblatt: “We solved a problem” in particular, like you have some specific task you were trying to solve, not just necessarily trying to produce understanding.

Buck Shlegeris: Uh-huh.

Daniel Filan: This sounds sort of similar to stuff Stephen Casper was talking about in his podcast with me about interpretability for engineers.

Ryan Greenblatt: Yeah, I think we came to pretty similar conclusions.

Daniel Filan: So those are two ways.

Buck Shlegeris: Yep. And so then we were like, “Well, it looks like our absolutely evaluated understanding is not very impressive” And then we, again using the exclusive we, did some work on measurement tampering detection, which was a paper where we tried to use model internals to solve a particular problem where we thought model internals-based methods had a serious advantage over things that didn’t use model internals.

Now the actual details of that paper are kind of complicated and I don’t want to get into them hugely much except saying it’s kind of a more structured weak-to-strong generalization setting, where it’s similar to the weak-to-strong generalization work, except we think there’s better reasons to think that the problem is solvable and it’s more specific. And around the same time, by coincidence, we decided to spend a lot of time thinking about the overall… what we were thinking of at the time as an alignment plan. A plan for: suppose we had to come up with some proposal for safely building powerful AI next year with no technological advancement. What should we do?

At that point [we] thought about a lot of this—Ryan did the majority of the quality-adjusted intellectual labor here or whatever—and we ended up thinking about this control stuff and being like: whoa, we were previously radically under-emphasizing the importance of measurability when you’re thinking about different protocols you could apply. We were previously equivocating between “alignment” meaning “safety”, and “alignment” meaning “the AI isn’t trying to kill you”, and we really should reason about these things differently.

And then had all these other ideas that felt a lot more basic for research that we should do. With the measurement tampering detection work, we were basically saying, “Well, we should try and come up with some techniques that are kind of moonshot-y or out there or not at all like the other things that are happening.” And then after doing the control thinking, we felt more like we could do useful work by doing techniques that were more absolutely basic, but in this setting that we thought was particularly underrated and particularly important to advocate for.

So, does that answer your question about how our current work relates to our previous work?

Daniel Filan: Yeah, I think it does.

Ryan Greenblatt: I would say, our thinking has just adjusted a moderate amount. I feel like the measurement tampering direction is in principle a really great direction, but we’re not doing it for a variety of reasons, or we’re not currently planning on pursuing that very heavily for a variety of reasons. I think I feel less good about mech. interp., or our work on mechanistic interpretability, at least in terms of how useful that overall area will ultimately end up being. We came away with a variety of takeaways after thinking about safety overall and then ended up doing a variety of projects based on that.

One additional takeaway I will note, that Buck didn’t mention, is thinking that people should really think about risk from scheming and deceptive alignment separately from risks from other problems and doing individual analysis on this. And I think that “scheming versus non-scheming risk” is a really good decomposition that people have historically I think not talked about. People have sometimes said “outer alignment versus inner alignment”, but I think that’s actually not a very good cut for a wide variety of reasons.

And I think “What is your threat model? Is your threat model scheming [or] non-scheming?”… In the case where it’s non-scheming, there’s a wide variety of solutions, even though both problems can involve concerns related to generalization and concerns related to imperfect ability to evaluate actions and things like this.

Daniel Filan: So you’re drawing a distinction between concerns about scheming and concerns about inner alignment. How are those different in your mind?

Ryan Greenblatt: Here’s an example inner alignment problem you can have, which is not well-described as scheming. You train your AI to perform well according to human labelers in a variety of contexts, and you trained your AI for a while and its actual behavior in that training process was acceptable at an absolute level. As in, it was just doing things that were pretty reasonable and good, but simultaneously it was doing things that were maybe slightly suboptimal that slightly exploited your oversight process.

And this means that the AI you end up getting is one that’s reasoning about the oversight process and is trying to do various exploitation, and then in some future situation where there is an egregious exploit it could do that is actually dangerous, it does seize the opportunity to do so and causes problems.

So in some sense, this isn’t really that well-described as an outer misalignment problem, in that your issue wasn’t centrally [that] you missed a bunch of bad behavior, you failed to label it as bad. But instead [your issue] was [that] the AI generalized in a really problematic way based on what inner objective it had or its heuristics or motivational complexes. But at the same time, it’s not super well-described as either, because if you did have perfect feedback, it would’ve solved the problem. Or if you had better mastery over the internals and what the internal structure was, that could have also solved the problem and it just didn’t decompose that well.

Daniel Filan: Okay, so for you that counts as scheming, but-

Ryan Greenblatt: No, no.

Daniel Filan: Oh, that doesn’t count as scheming.

Ryan Greenblatt: That specific case doesn’t count as scheming because the reason why the AI took bad actions eventually was not that it was playing nice, it was just that it was doing the same thing always. In the simplest case, it was like, “I want to achieve this proxy,” which was historically correlated with what got reinforced. And then in some new case, the best way, the way that naturally generalizes, would be to seize control.

Daniel Filan: Okay. So for you scheming is when you suddenly start trying to do something different than what you were trying to do previously?

Ryan Greenblatt: No. Scheming is, “The reason why I’m performing well is because I have long-run aims that I want to achieve.” Also called “deceptive alignment” or whatever. The reason why the AI looks good isn’t because it’s just doing the thing on its own merits, it’s because it’s like, “If I look good now, then that will acquire power and influence for me in the future, or maybe for a different AI in the future.” And that’s pretty different.

So in the case I was describing earlier, the reason why the AI performs well is it’s like, for example, “I’d like to make the human press the ‘thumbs-up’ button,” though probably it’s some more complicated mash of heuristics that is historically correlated with performing well.

Daniel Filan: Right. So somehow “inner and outer alignment” is talking about the distinction of: is the thing you were worried about captured by the reward function? Whereas scheming is: is the reason you did that thing because you wanted to achieve some crazy outcome later, or is it just the obvious…?

Ryan Greenblatt: Yeah. I feel like another general takeaway is: I think people in the safety community often have things that are better thought of as threat models, which they then think about as things you’re trying to achieve for safety. I think “outer alignment versus inner alignment” is an example, where I think it is clarifying to think about misalignment due to bad rewards and misalignment which is not due to bad rewards, and I think those are useful frames, or can be useful decompositions. But I think being like, “Our goal is to achieve outer alignment,” is a pretty confusing way to think about the problem. Which is of course in some sense not a very novel take, because I think people have expressed somewhat similar things in various contexts. Same for alignment versus misalignment also.

How people can work on AI control

Daniel Filan: So if somebody is listening to this and they’re interested in trying to work on the control agenda, trying to push it forward, what would you encourage them to do?

Buck Shlegeris: I mean, suppose they’re a researcher and they want to research control stuff, it’s plausible they should do research projects on this. I think there’s a lot of obvious follow-up work to do on our paper. A couple of people are working on stuff like that, I mean including us. They are welcome to email me for advice on doing control projects. I think there’s reasonable ML papers you can write about control.

Daniel Filan: What are some examples?

Buck Shlegeris: So for example, the thing that I’m doing literally right now is a follow-up… Or sorry, literally right now I’m recording a podcast. It’s similar to the control setting, it’s similar to our previous paper, except instead of the task being writing code and the catastrophe being the code being backdoored, the model is an agent in a bash shell solving various random DevOps problems for you and the catastrophe is it running malware that it downloads from the internet. This is just a different setting. It’s going to demonstrate different failure modes and different protocols are going to turn out to be the best.

And then I think from doing this project, we’re going to learn some things that we didn’t previously know about the extent to which protocols work in a variety of situations; which things generalize from our previous paper and which things don’t.

Daniel Filan: Okay.

Buck Shlegeris: To choose an example of a very different type of thing, you could try to do a control paper where there isn’t a catastrophe that is unacceptable to happen even once. You could do what we would call a “low-stakes” control experiment, where the safety is just the proportion of actions which were taken that were bad.

So you could just take our setting, except imagine that after submitting backdoored code you just get one safety demerit point, but the setting doesn’t finish, the scenario doesn’t finish. And that’s just another setting where the protocols are going to be somewhat different.

Ryan Greenblatt: And closely related to that, I think the central case in which you see low-stakes-type stuff is: you wanted to do a bunch of work with your AI and you’re worried that it would be systematically sabotaging that work. And so we’re pretty interested in work that looks like: for example, the nicest case would be you have AIs do a bunch of small research tasks. And then the question is, can a red team create an AI that systematically sabotages those research tasks such that you yield incorrect conclusions a high fraction of the time? And it would be fine if you got incorrect conclusions 0.1% of the time, but it would be really sad if you got incorrect conclusions like 10% of the time. And the question is: how much can you cut down that percentage of the time that the answers are intentionally wrong? Or this sort of thing.

Buck Shlegeris: Other work you could do is: if you work at an AI lab and are influential in them writing their responsible scaling policies or analogous documents, you could consider trying to ensure safety via a control-type approach. If you are interested in thinking about regulations that would ensure safety, you might consider using these types of ideas in your regulations. If you do technical research at safety labs, you might consider thinking about that research in this frame. A lot of work people do, that they label as “scalable oversight” work or a couple of other types of things, can be framed in terms of control. And I think the work would be often somewhat better if they explicitly adopted this frame and did control evaluations.

Ryan Greenblatt: Also feel free to contact us if you want to talk about some of this stuff. Or particularly if you’re a technical safety researcher who is about to do your next project, definitely feel free to contact us.

And I think another thing that’s closely related is that I think in labs’ RSPs [responsible scaling policies] or the equivalent of RSPs, they should have policies for what they do if they catch the AI. As in: one of the key things that we’ve talked about is, okay, our goal will be to ensure that we catch the AI before it causes problems, but we should also have a proposal for what’s going to happen when we catch the AI. And I think this is somewhat under-discussed, [the topic] of: well, we’re going to want to be able to make sure that we have some probability of catching the AI, and then also we need to have a plan for what happens, and we need to be ready to shut things down and be ready to respond, and have all of that set up ahead of time.

Buck Shlegeris: I want to clarify re: the previous point that we do in fact talk to people from AI labs, rather than merely appear on AXRP hoping that they reach out to us.

Daniel Filan: Fair enough. So speaking of things you might want to clarify before we close out the interview, I’m wondering: is there anything that you guys wish that I’d asked?

Ryan Greenblatt: I mean, I think there are many topics to talk about, but I don’t think there’s any specific one that seems that important.

Following Buck and Ryan’s research

Daniel Filan: Fair enough. Well, if people listen to this podcast and they’re interested in following you two or Redwood Research in general, or maybe they want to reach out, how should they do that?

Buck Shlegeris: They can find our emails. They can follow me on Twitter, they can follow Ryan on LessWrong.

Ryan Greenblatt: We’re probably going to spin up a blog at some point, but we haven’t gotten around to it yet. And so you can follow our blog, which is just going to be our LessWrong posts, but on a different website. So if you prefer a different website for some reason.

Daniel Filan: Okay. And it’s possible, given production timelines, that the blog will even be out by the time this episode airs. Who knows? And your emails, I’ll include that in the description of this episode.

Buck Shlegeris: Great.

Ryan Greenblatt: Yep.

Daniel Filan: All right. Well Buck and Ryan, thank you very much for coming on AXRP.

Buck Shlegeris: Sure, thanks.

Daniel Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. This episode was filmed at a location provided by Lightcone Infrastructure. Financial support for this episode was provided by the Long-Term Future Fund and Lightspeed Grants, along with patrons such as Alexey Malafeev and Tor Barstad. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss

AXRP Episode 26 - AI Governance with Elizabeth Seger

Published on November 26, 2023 11:00 PM GMT

YouTube link

The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode, I’ll be speaking with Elizabeth Seger. Elizabeth completed her PhD in philosophy of science at Cambridge in 2022, and is now a researcher at the Centre for Governance of AI in Oxford, where she works on AI democratization and open-source AI regulation. She recently led the production of a large report on the risks and benefits of model-sharing, which we will talk about in this episode. For links to what we’re discussing, you can check the description of the episode and you can read the transcript at axrp.net.

Well, Elizabeth, welcome to the podcast.

Elizabeth Seger: Awesome. Thanks for having me.

What kinds of AI?

Daniel Filan: Cool. We’re going to be talking about a couple of papers basically about democratizing and open-sourcing AI. When we’re talking about this, what sorts of AI should we be thinking about? What sorts of AI do you have primarily in mind?

Elizabeth Seger: When talking about open-sourcing and model-sharing, I’ve been primarily interested in talking about frontier foundation models, so understanding frontier foundation models as those systems on the cutting edge of development right now. When talking more generally about AI democratization, I tend to think much more broadly. I think a lot of conversation around democratization right now is about frontier AI systems, but it’s important to recognize that AI is a very broad category. Well, so is democratization, which is I’m sure something we’ll talk about.

Democratizing AI

Daniel Filan: Okay, cool. In that case, let’s actually just dive into the democratization paper. This is Democratizing AI: Multiple Meanings, Goals, and Methods by yourself, Aviv Ovadya, Ben Garfinkel, Divya Siddarth, and Allan Dafoe. Before I start asking questions, can you give us just a rundown of basically what is this paper?

Elizabeth Seger: Yeah, so this paper on AI democratization was born out of an observation around how different actors were using the term “AI democratization” or “democratizing AI”, whether we’re talking about labs developing AI systems, or policymakers, or even people in the AI governance space. There just seemed to be a lot of inconsistency around what this term AI democratization meant.

The paper really started out as just an effort to explain what does it actually mean when someone says “Democratizing AI.” I think there’s overlap here with the discussions and debates around open-sourcing. I think there was a lot of… at least for me personally, there was frustration when people would say, “Oh, well, we open-sourced the model and therefore, it’s democratized.” It’s like, “What does that even mean?”

The goal of this AI democratization paper was really just to outline the different meanings of AI democratization that were in use. Not necessarily saying how it should be used, but just how is the term being used. We broke it up into four categories. Democratization of AI use is just allowing many more people to interact with, and use, and benefit from the technology. Then there’s the democratization of development, which is allowing many more people to contribute to the development processes and help make the systems really cater to many diverse interests and needs.

Democratization of profits, which is about basically distributing the profits that might accrue to labs that are the primary developers and controllers of the technology, and those profits could be massive. Then the democratization of AI governance, which is just involving more people in decision-making processes about AI, about how it’s distributed, about how it’s developed, about how it’s regulated, and about who makes the decisions. These are all different ways that democratization has been discussed.

Then in the paper, we go a step further and say, “Okay. Well, if these are the different kinds of democratization, what are the stated goals of those kinds of democratization, and then what activities help support those goals?” I think the main idea here was to show that, oftentimes, when people talk about AI democratization, they’re specifically talking about open-sourcing or model-sharing. I think one of the main points we try and make is to say something like, for example, open-sourcing AI systems is one form of model-sharing. Model-sharing is one aspect of democratizing the development of AI systems, and democratizing the development of AI systems is one form of AI democratization.

If you’re going to commit to an AI democratization effort or say that these are the goals you have in mind, there’s so much more involved in these processes. It’s not just about releasing a model, but many more proactive efforts that can go into distributing access to the models, ability to use the models, and to benefit from the models, whether that’s through use or through the profits that it produces.

How people talk about democratizing AI

Daniel Filan: Gotcha. In the paper, you cite a few examples, basically, of people either talking about something akin to democratization or specifically using the term “democratizing AI”. In all the examples you use, it’s basically labs talking about it, and it’s mostly in a context of, “We want to democratize AI in the sense of everybody should use our product”, is how I read it. I’m wondering, do people other than labs talk about democratizing AI? Is this a thing broadly?

Elizabeth Seger: Yeah, absolutely. In the paper, I think we focused a lot of the lab terminology because to be honest, it was written initially as sort of a pushback against the use of the term by labs to basically say, “Use our product,” and it’s like, “Okay. What does this actually mean? There’s more to it, guys.” But the term AI democratization or even just discussing AI democratization is much more widely engaged with.

One of the co-authors, Divya Siddarth, for example, she and Saffron Huang run the Collective Intelligence Project, and this is a group that focuses largely on democratizing AI governance processes. So getting lots of people involved in decision-making about, for example, AI alignment or just about different kinds of governance decisions, and how that can align with the needs and interests of really diverse populations.

I think there are definitely groups that are popping up and getting really involved in AI democratization from the democratizing governance standpoint. It is a term that’s used by labs, and not necessarily consistently to just mean, “Use our products.” Some of them mean … when Stability AI discussed AI democratization and the whole, “AI for the people by the people” mantra, that was very much about model-sharing and making models accessible to lots of people to use as they saw fit. So yeah, you see this in labs.

You see this with a lot of groups that are popping up to help democratize governance processes. [You] probably hear about it less in policy and governance circles, though sometimes, I’ve talked to union leaders and when they think about AI democratization, they’ve spoken about it like, “How are these technologies going to help our workforce?”, both in the sense of making sure that it helps the workforce with their needs and helps make their jobs easier, but not necessarily in a way that would threaten jobs. And how do we make the integration of the systems into new environments and settings [that] actually work for the people that are using them and integrating different perspectives in that way?

It’s definitely something that I think is … It’s a terminology that’s been spreading and part of why we wrote this paper is trying to understand all the different ways the terminology’s being used, though it was largely written in response to the way it was being thrown around by labs and being seen in the media.

Daniel Filan: Right. That actually gets to something I was thinking. In this paper, you list these four definitions. That inspired me to think like, “Okay. Can I think of any other things that I might mean by ‘democratizing AI’?” One thing that it sounded like these union leaders were talking about is something like how customizable AI is, or to what extent can any person wield it specifically to do their own thing rather than just getting a one-size-fits-all deal? I’m wondering if you have any thoughts about that at least potential use of the term “democratization”.

Elizabeth Seger: Yeah, so I think that this is something that we’ve sort of filed under the “democratizing use” category. One of the things that we tried to do under each of these categories is to show how there wasn’t necessarily just one way to democratize use or one way to democratize development. When we talk about democratizing use, one way to do that is just to make a product available. Like say, “Here, use GPT-4. There you go.” It’s available, easy access.

But other ways to democratize use are, for example, to make the systems more easily accessible by maybe having an interface be more intuitive. Like the models, the API interface, have that be more intuitive. Or to provide services to help it be more easily customizable, as you mentioned, to allow people to… whether they’re fine-tuning it to focus on a particular need or maybe there are ways through plug-ins, for example, to integrate it with different services, into different applications. Or then even to provide support services to help people integrate your products into downstream applications or into different workstreams. These are all different things that can contribute towards democratizing the use of a system, and I think customization’s definitely one of those.

Is democratizing AI important?

Daniel Filan: Gotcha. Now that we have a better sense of just what we’re talking about in terms of democratization, a question that you address a little bit in the paper, but isn’t precisely at the forefront, is just: how important is democratization in this context? Because for most technologies, I don’t think a ton about their democratization - air dehumidifiers, or water bottles, or something. Maybe I’m heartless, but for me, democratization is just not near the top of the normative priorities for these things. Should we even care that much about democratizing AI?

Elizabeth Seger: The short answer is yes. I think we should care about democratizing AI. You make a good point. We don’t talk about democratizing air humidifiers or water bottles, but I think - this is a question that comes up when talking about AI a lot, is just, how is it different from other technologies? This is a question that’s been asked, whether you’re talking about how it’s shared or in this context, how it’s democratized.

Specifically, with regard to the discussion on democratization, making something available for more people to use, available for more people to benefit from or contribute to the design processes, I think this is particularly important in the case of AI. (A), because if AI promises to be as transformative a technology as we’re hoping it to be, this will massively impact lives across the globe from different groups, different nationalities, different geographic settings. Making sure that we can have input from different groups into design processes and to understanding what needs are best served, that’s going to be important to make sure that this technology doesn’t just serve a few very well, but serves many people and benefits the world as a whole.

I think this is also particularly important to note with regard to the democratization of profits. AI is already, but may continue to be just an insanely financially lucrative industry. We could begin to see the accrual of profits to leading labs on huge scales where we might see a particular company, for example, be able to measure its gross income in terms of total percentages of total world output or something. That is huge economic profit, and we want a way to make sure that, ideally, everyone benefits from this, it doesn’t just pile up in a few companies in a few geographic locations like Silicon Valley.

How do we make sure that this is well-distributed? There are some direct methods that could be employed. In the section of the paper where we talk about democratizing profits, some of the things that we discuss are you’d have adherence to windfall clauses. This would be something where, let’s say a company has some kind of windfall profit measured in terms of percentage of gross world output. In that case, there might be a commitment to redistribute some of those funds, or we might see taxation and redistribution schemes.

There are also more indirect ways as well that you can think about distributing profits. This would be, for example, are there ways that we can get more people involved in democratization of development? More people involved in developing products, so there’s overlap between democratizing profits and democratizing development. More people can contribute to development of products, then this might challenge a natural monopoly that might otherwise form around large labs. More competition, more distribution of profits, more people in on the game.

So yeah, I think what it comes down to is if this technology can be as big and as transformative as it promises to be, having more people involved in the development process and being able to benefit from the value that’s produced by these systems is really important.

Links between types of democratization

Daniel Filan: Yeah, I think that makes sense. Actually, that reminds me of … As you mentioned, there’s some link between democratizing, I guess, the development of AI and the profits, in the sense that if more people make AI products, that means more people can profit from them, presumably. There’s also, to my mind, there’s a continuum between the use of AI and the development of AI, right? In some sense, when you use a thing for a task, you’re developing a way of using it for that task, right? There’s sort of a fuzzy line there. There’s also some sort of link between developing AI and governing AI, just in the sense that if I develop some AI product, I am now, de facto, the main person governing how it is made and how it is-

Elizabeth Seger: Yeah, which is sort of what we’re seeing right now. A lot of the AI governance discussion is around the fact that major AI labs are making huge, impactful decisions about the future of AI. They’re making AI governance decisions, and so there’s a big open question of who should actually be making these [decisions]. Should it just be a few unelected tech leaders of major companies, or should this be distributed more? So yeah, there’s definitely overlap between categories.

Daniel Filan: Yeah. I take the point of your paper to be saying like, “Oh, these are very distinct. These are conceptually pretty distinct and you can promote one sort of democratization without promoting other sorts of democratization.” But to the degree that they do seem to have these links and overlaps, is it maybe a case of just a rising tide lifts all boats and buy one kind of democratization, get the others free?

Elizabeth Seger: I think kind of, maybe, sort of, not really. All of the above. No, so there definitely is overlap between some of the categories. I think it’s either in this paper or maybe in the blog post version, I think we say like, “These are not perfectly distinct categories. There’s going to be overlap between them.” I think part of the utility of breaking it up into the categories is to point out that there are lots of different meanings. There are lots of different ways to achieve these different meanings and goals.

It’s not just about just open-sourcing a model or just making it available for people to use. There’s a lot more to it. I think, in some respects, there is sort of a “rising tide lifts all boats” kind of thing. Like you said, you might get more people involved in development processes. If more people are developing AI systems, more people profit from it, and that can help distribute profits.

On the other hand, you can also see direct conflicts between categories sometimes. For example, you might… I think this is something we write about in the paper: talking about democratizing governance, so decisions about AI, whether that’s about how AI is developed or even how decisions are made about how models are shared.

These days, there’s a really heated debate around open-sourcing frontier AI models. Let’s say you democratize this decision-making process around whether or not certain models should be released open-source. Let’s say the outcome of this democratized decision-making process is to say, “Okay. Maybe some models that pose substantially high risks should not be released open-source.” This is a hypothetical, but let’s say that’s the outcome of a deliberative, democratic process and decision-making. That could be in direct conflict with the interests of democratizing development, where democratizing development… that is always furthered by providing better access to models.

You could have a case in which you’ve democratized a governance decision, but the outcome of that governance decision is to impede some other form of democratization. I mean, not just with development. I guess, you could have a democratic process that says, I don’t know, like, “Large companies should not be taxed and have profits redistributed.” That would be in direct conflict with an effort to democratize profits, so there could be conflicts. I think it’s usually between the governance category and the other categories.

I guess, governance often will conflict with specific goals. That’s kind of the purpose, to temper decision-making and get more people involved in the decision-making processes, and make sure that those decisions are democratically legitimate, and reflect the needs and interests of a wider population of impacted people and stakeholders.

Daniel Filan: Okay, yeah. This actually brings up a question I had about the interaction between democratizing use/development and democratizing governance. In the paper, as you mention… I guess the main interaction you draw out is this tension between democratizing use… It’s potentially not necessarily what a democratic process would produce on the governance side.

When I think about… since, I don’t know, about a year ago, we’ve had ChatGPT, which in some sense, really democratized, I guess, the use of large language models in the sense that it’s relatively cheap, maybe free to use these things. One effect is that now it is impossible to at least have banned ChatGPT a year ago or something. But to my mind, a big effect of this is something like, increasing people’s knowledge of where AI is at, what kinds of things the language models can do.

And essentially, I don’t know if this is a term, but you can think of people as having governance demands, right? If I know more about a technology, I might say, “Oh, this is totally fine.” Or, “Oh, we need to do this thing or that thing about it.” If people know more about a technology, they can basically have a better sense of what they want their governance demands to be. In that sense, it seems like at least to some degree, there can be a positive interaction between democratizing use and democratizing governance.

Elizabeth Seger: Yeah, I think absolutely. This is sort of a classic pillar of democracy, is having the informed public. You need to have the information to be able to do the good decision-making about the object of your decision-making. Yeah, so I think the release of ChatGPT, for example, really put AI on the public stage. Suddenly, I have my mom and dad texting me, asking me questions about AI, and that never would have happened a year ago. Yeah, so I think it really put the technology on the stage. It’s got more people involved in it. I think more people have a general sense of at least what current capabilities are able to do and of the limitations. That’s all been very valuable for enabling more democratic decision-making processes about AI.

There’s definitely that link between you democratize use, you democratize development, people understand the technology better. They’ll be able to make more well-informed decisions or form more well-informed opinions about how the technology should be regulated. So I think that that’s not necessarily in tension. I guess, the tension tends to go the other direction. It’s like if a governance decision perhaps would say, “Oh, this particular technology is too dangerous.” Or, “Releasing this particular model surpasses an acceptable risk threshold.”

I guess, if you took a strict definition of democratizing development as meaning always spreading the technology further, always getting more people involved in a development process, always making a system accessible, then yeah, a decision to restrict access would be in direct conflict with democratizing development.

I think this is another thing we try and say in the paper: trying to get away from this idea that AI democratization is inherently good. I think this is a problem with the term democratization, especially in Western, democratic societies: oftentimes, democracy is used a a stand-in for all things good, so you get companies saying, “Oh, we’re democratizing AI”. It almost doesn’t even matter what that means. They’ve democracy-washed their products. Now it looks like a good thing.

I think that was one thing we were trying to get across in the paper is, yeah, AI democratization is not necessarily a good thing. Spreading a technology for more people to contribute to a development process or for more people to use is only good if that’s actually going to benefit people. But if it poses significant risk or it was going to put a lot of people in harm’s way, then that’s not inherently positive.

I guess, one thing to say is just if all that’s meant when people say, “AI democratization,” or, “democratize AI” is just make the system more accessible, then just say, “Make the system more accessible”, because I think that’s very clear what it means and it’s not inherently good or bad. I don’t know if we’re going to successfully change everyone’s terminology in the course of one podcast, but-

Daniel Filan: Well, we can try.

Elizabeth Seger: We can try. I think that’s maybe one key takeaway from the paper as well. If people say they’re engaging in an effort to democratize AI, that does not necessarily mean it’s going to be a net beneficial effort.

Democratizing profits from AI

Daniel Filan: Gotcha. I guess, I’d like to talk a bit about the individual senses. In particular, I want to start off with when you talk about democratizing profits. One thing I noticed is, I think there’s one or two quotes you use in this section. I think the quotes actually say something like, “We want to make sure that not all the value created by AI stays in one place or that the benefits of AI don’t just accrue to a small group of people.”

In those quotes, they don’t actually use the term “profit”. In some ways, you can imagine that these could go different ways. For instance, take GitHub Copilot. It costs, I guess, $100 a year. Suppose that everyone in the world just pays $100 a year and it all goes to GitHub. Tons of people use GitHub Copilot to do incredibly useful things, and they get a bunch of value and benefits out of it. Probably, most of the benefits will be financial, but not all of them will be. That seems to me like it would be a case of spreading the value or benefits of AI, but it’s not necessarily a case of spreading the profit from AI.

Elizabeth Seger: Yeah, the profits category is kind of a… It was an odd category. Actually, in the process of writing this paper… Unlike most papers, we actually wrote the blog post first and then wrote the paper. It’s supposed to go in the other direction. In the blog post, which is on GovAI’s website, it actually breaks into four categories: Democratization of use, democratization of development, democratization of benefits, and democratization of governance.

For the version in the paper, we chose the profits terminology instead of the benefits terminology because benefits, in a way, it was too broad because… I mean, you’ve already pointed this out. There’s a lot of overlap between these categories, and one reason to democratize development, for example, is to make sure that more people can benefit from the technology, that it serves their need.

A reason to democratize use is to ensure that more people can access and use, and even use the systems to produce value and profit. If you could integrate it with your own businesses and with your own applications, you can develop value on it from your side. I think we chose this democratization of profits terminology just to point out that there is a discussion around just the massive accrual of profits to large tech companies.

I think, yeah, you do point out some of the quotes that we use talk a little bit more broadly about accrual of value. I think, sometimes it’s hard to measure profits just in terms of the literal money that drops into someone’s bank account.

Daniel Filan: Okay.

Elizabeth Seger: So I think… it’s hard to find exact quotes from companies. I know that there’s been a lot of discussion not by companies, but actually, this is one thing that you will hear policymakers talk about more. This was part of when universal basic income was something that was being discussed a lot. It had its moment when people were talking about it a lot more. A lot of the discussion around universal basic income was around… Well, it was around job loss caused by automation, but also around the accrual of profits to tech companies and to developers of these new technologies. Okay, if all the profits are going to shift from the workforce to the developers of these technologies, how are we going to redistribute it out? That’s actually one context in which democratization of profits was directly linked to the discussion around AI development.

Yeah, I think maybe some of the quotes we used in the paper do more vaguely gesture to just the value. But we separated profits out just because the way we had it in the blog post, it was a bit too general. There was too much overlap. We were kind of double-dipping, you know? I think it was just to illustrate that there is this discussion specifically about profits, but I don’t think… I mean, it’s definitely not the most commonly used. If you heard someone talking about AI democratization just walking down the street, you wouldn’t immediately think, “Oh, they must be talking about profit redistribution.” That’s not where your brain goes. But it is one aspect worth noting, I think, is the main takeaway.

Democratizing AI governance

Daniel Filan: Fair enough. Yeah, the next one I want to talk a bit more about is democratizing AI governance. Related to the thing you just said, it seems to me that if somebody talks about democratizing AI, I think it’s pretty unlikely that if they just say, “democratizing AI”, they mean democratizing governance. I’m wondering if that matches with your experience.

Elizabeth Seger: Yes and no. I think, oftentimes, when you see in media or you’re listening to tech companies and the dialog that they’re having… yeah, very rarely will it talk about democratizing governance. I think we are starting to see a shift though. For example, OpenAI had this AI democratization grant program - I can’t remember exactly what it’s called right now. Basically, they’re putting out grants to research groups to study methods for basically trying to bring in input from more diverse communities to try and better understand what values systems should align to. In that sense, you’re talking about AI democratization in this more participatory, deliberative process of bringing people into decision-making processes to define principles or to define values for AI alignment. OpenAI had that grant program.

I think Anthropic just put out a blog post. They’d worked closely with the Collective Intelligence Project that I mentioned earlier doing a similar thing for developing their constitutional AI, defining what the constitutional principles are that AI systems should be aligned to, and that was more of a democratic governance process. So actually, I think we’re starting to see this terminology of “AI democratization”, “democratizing AI” seep into the technical landscape a bit more. I think that’s a very positive thing. I think it’s showing that they’re starting to tap into and really embody the democratic principles of democratizing AI. As opposed to just meaning distribution and access, it actually means reflecting and representing the interests and values of stakeholders and impacted populations.

I think we are starting to see that shift, but I think you’re probably right in that mostly… again, [if when] walking down the street, you hear someone talking about AI democratization, [they’re] probably just talking about distribution of the technology, making it more accessible. That’s the classic terminology, but I do think that in the AI governance space and even in the terminology being used by labs, we’re starting to see the governance meaning seep in, and that’s exciting.

Normative underpinnings of democratization

Daniel Filan: Gotcha. Getting to the question about whether this is good, I think in the paper, you say something like: the notion of democratization of use, development, and profit… I think you say something like those ones aren’t inherently good and they get their normative force from democratization basically in terms of the governance method. So I guess first of all, that wasn’t actually so obvious to me. So I think sometimes democratization, I think it often carries this implicit sense of being more related to egalitarianism than democracy as a political decision-making method. I think you can have egalitarianism without democratic decision-making and you can have democratic decision-making without egalitarianism, right? People can vote away minorities’ rights or whatever. So yeah, I was wondering why do you say… I guess what I actually mean is please defend your claim that that’s the source of the goodness of democracy, or of democratization, is the word.

Elizabeth Seger: Yeah. Okay. So a couple of things, three things, maybe two things. We’ll see how it comes out. So to start out with, I do stand by the idea that the first three forms of democratization, so development, use, profits, [are] not necessarily inherently good; not inherently bad, just these are things that happen, and if we stick by their definitions it just means making something more accessible, whether that’s access for use or development or making the profits more accessible. Access is not inherently good or bad. Again, if we use the “access for development” example, if you have a technology that’s going to be potentially really dangerous, you don’t necessarily want everyone to have access to that. And a democratic decision-making process might result in the conclusion that indeed not everyone should have access to it.

So that’s the first part: standing by that first half of the claim. The second part, around the thing that gives it the moral force or value is the democratic decision-making process that was involved… I don’t know if that’s exactly what we’re going for in the paper. I think there are lots of different methods that you can use to help reflect and serve the interests of wider populations. I think for some technologies… let’s go back to your water bottle example. We don’t need a panel of people from around the globe to tell us how to distribute water bottles. This is a decision that probably water bottle distributors can be like, “let’s just sell our water bottles to as many people as possible” and [it’s] probably fine.

I think it comes to where the impact’s going to be, how big the impact’s going to be, are there areas where the negative impacts of a technology might be more disproportionately felt? In those cases, being able to bring those voices in to inform decision-making processes that might affect them disproportionately is really important. I know we give examples in the paper of different kinds of democratic processes that can be used to inform decision-making, whether these are participatory processes or more deliberative democratic processes. So in the open source paper we just released, we talk about some of these processes as ways of making more democratic decisions around AI, but then also point to: one way to democratize governance decisions is to support regulation that is put forth by democratic governance and that those governments hopefully if structured well, reflect and serve the interests of the constituents.

And hopefully some of those governance processes are actually formed by having maybe some sort of deliberative process that takes into account stakeholder interests and the interests of impacted populations. But it’s not like the only way to govern AI is to say, “okay, let’s have a participatory panel where we take in all this insight and then use it to inform every decision”. I think you’re right, that would just be completely impractical to have every decision that a lab makes be informed by a participatory panel or some sort of deliberative democratic process. So I think you kind of have to range it depending on the potential impact of the question. So I don’t know if that answers your question completely, but I guess I stand by the first point, that democratization is not inherently good. The extent of the goodness I guess reflects how well the decisions reflect and serve the interests of the people that are going to be impacted. And then there are lots of different ways of figuring out how to do that.

Daniel Filan: Okay, that makes sense. So I guess the second thing I want to ask about that sentence is: when you say that the democratization of use, development and benefits is not inherently good, it kind of makes it sound like you think that democratization of governance is inherently good. So I found this paper that you actually cite, it’s called Against “Democratizing AI” by Himmelreich, published in 2023. So he basically makes a few critiques. Firstly, he just says that AI isn’t the sort of thing that needs to be democratic. AI in itself doesn’t affect a bunch of people. There are just organizations that use AI and they affect a bunch of people and maybe they should be democratized, but AI itself doesn’t have to be.

He says that the relevant things are already democratic. Democracy is pretty resource-intensive. People have to know a bunch of stuff, maybe you have to take a bunch of votes and things. He also has complaints that democracy is imperfect and doesn’t necessarily fix oppression. So there are a few complaints about democratizing AI floating around. I guess in this case he’s talking about democratizing AI governance. Is it the case that democratizing AI governance is inherently good?

Elizabeth Seger: I don’t think so. Again, I think it comes back to the “ranging it” question. First of all, the resource-intensive point: absolutely. It can be incredibly resource-intensive to do an in-depth participatory governance process. This is why you don’t do it for every single question that’s out there. Yeah, so that’s really good point. There’s also the point of you need to be well-informed to make good decisions. So maybe we don’t want the general public making decisions about all questions around AI. Experts are having a hard enough time for themselves trying to decide what constitutes a high risk system. We’re not even good at benchmarking this among the community of AI experts. How are you going to tap into more general public notions of what a high risk system is or what proper benchmarking should look like?

So there’s some questions where it just doesn’t make sense to democratize it. So I think you have to think about: what questions is it realistic to have more democratized discussions around? So maybe something like thinking about what acceptable risk thresholds are, or what kind of values to align to. I think it’s also important to remember there are lots of different kinds of democratic processes. When we talk about democratic processes, it’s not necessarily what you might think of: we all go to the polls and cast our ballot on “do we want A or do we want B?” These could be more deliberative processes where you just get some stakeholders in a room, they have a discussion, they try and figure out what are the main issues [in] these discussions. So there are processes in place to hold deliberative democratic processes that are informed, you have experts come in and try and inform the discussion. So this would be a more informed deliberative process.

One way is: sometimes when it comes to… with image generation systems, there was a whole bunch of discussion around bias and images. Again, we see overlap between governance and other kinds of categories. If you get more people involved using the systems and they start understanding where the biases are, that is a form of education - having more familiarity. And then you can raise these complaints, whether that’s directly to the companies or maybe through some sort of political intermediary. I remember there being, what was it? A statement released by Anna Eshoo, who I think was the congressional representative in, I want to say South San Jose, that talked in quite a bit of depth about … it was right after the release of Stable Diffusion and talking about how there was a lot of racial bias in some of the images that were being produced, specifically against Asian women.

So this is something that was raised through a democratic process, was brought to the attention of the congressional office, and then the statement was released, and then that had knock-on effects on future governance decisions and decision-making by labs to do things like put safety filters. So I think that’s one thing to keep in mind: there’s a wide range of things that you can think about in terms of what it means to democratize a governance process. Governance is just decision-making. Then how do you make sure those decision-making processes reflect the interests of the people who are going to be impacted. It’s not just about direct participation, although there is a lot of work being done to bring more direct participatory methods into even the more small scale decision-making, to make it more realistic.

Like this complaint about it being so resource intensive: that’s true. We also have technologies available to help make it less resource intensive and how can we tap into those to actually help a public input to these decisions in a well-informed way? So there’s lots of interesting work going on. Democratic governance also isn’t inherently good when it comes to AI governance. I think again, it depends on the question. If the cost of doing a full-blown democratic process is going to far outstrip the benefit of doing a full-blown democratic process, then there’d be a situation in which is probably not a net benefit to do a democratic decision-making process.

Daniel Filan: So that gets to Himmelreich’s resource-intensive complaint. He also has these complaints that basically AI itself is not the kind of thing that needs to be democratically governed. So he says something like, I’ll see if I can remember, but the kind of thing that needs to be democratically governed is maybe something that just impinges on people’s lives, whether they like or not, or something that’s just inherent to large scale cooperation or something like that. He basically makes this claim that, look, AI doesn’t do those things. AI is housed in institutions that do those things. So if you’re worried about, maybe the police using facial recognition or something, you should worry about democratic accountability of police, not facial recognition, is the point I take him to be making.

So firstly, he says, look, we don’t need to democratize AI. We need to democratize things that potentially use AI. And secondly, it seems like he’s saying the relevant things that would be using AI, they’re already democratic, so we shouldn’t have conflicting … I think he talks about democratic overlap or something, where I think he really doesn’t like the idea of these different democratic institutions trying to affect the same decision and maybe they conflict or something.

Elizabeth Seger: So I think I don’t disagree. I think, if I understood correctly, this idea of “you don’t need to democratize the AI, you need to democratize the institution that’s using it”, I think that’s a completely valid point. AI itself, again, is not inherently bad or good. It’s a dual use technology. It depends what you’re going to do with the thing. But I would say that trying to decide, for example, how law enforcement should use AI systems, that is an AI governance question. This is a question about how AI is being used, about how AI is being regulated in this context, not in terms of how it’s being developed, but in context of how it’s being used.

What are the appropriate use cases of AI? This is an AI governance question, and it’s not necessarily that these are regulations that are placed on AI specifically. A lot of AI regulation just already exists within different institutions. It’s just figuring out how to have that regulation apply to AI systems. This might be cases like: if you have requirements for certain health and safety standards that have to be met by a certain technology in a work setting, AI shouldn’t be held to different standards. It’s just a matter of figuring out how to measure and make sure that those standards are met by the AI system. So I guess based on what you said, I want to say that I completely agree.

But I would say that it still falls under the umbrella of democratizing AI governance. I think that what might be happening here is just another conflict over what does the terminology mean, where it’s like “we don’t need to democratize AI, as in we don’t need to get more people directly involved in the development and decision-making about individual decisions around AI development”. In which case, yes, I agree! But we might be talking about democratizing AI governance in different… This is why these discussions get really complicated because we just end up talking past each other because we use the same word to mean five different things.

Daniel Filan: Yeah, it can be rough. So before we close up our discussion on democratizing AI, I’m wondering… I guess you’ve thought about democratizing AI for a while. Do you have any favorite interventions to make AI more democratic or to make AI less democratic in any of the relevant senses?

Elizabeth Seger: I’ll be completely honest, this is sort of where my expertise drops off in terms of the more nitty-gritty of what processes are available and exactly when are certain processes the best ones to be applied. For this, I would really look more closely at the work that the Collective Intelligence Project’s doing. Or my co-author Aviv Ovadya, who’s on the paper, he’s really involved with these discussions and works with various labs to help implement different democratic processes. Yeah, I was just going to say, this is where my expertise drops off and I would point towards my colleagues for more in-depth discussion on that work.

Open-sourcing AI

Daniel Filan: All right, the next thing I want to talk about is basically open-sourcing AI. So there’s this report, Open-sourcing highly capable foundation models: an evaluation of risks, benefits, and alternative methods for pursuing open source objectives. It’s by yourself and a bunch of other co-authors, but mostly as far as I could see at the Center for the Governance of AI. Again, could you perhaps give an overview of what’s going on in this report?

Elizabeth Seger: Okay, yeah. So this report’s a little harder to give an overview on because unlike the democratization paper, which is eight pages, this one’s about 60 pages.

So quick overview. First thing is that we wanted to point out that there’s debate right now around whether or not large foundation models, especially frontier foundation models, should be open source. This is a debate that has kicked off starting with Stability AI’s release of Stable Diffusion, and then there was the Llama 2 release and then the weights were leaked, and now Meta’s getting behind this open source thing; there were the protests outside of Meta about whether or not systems should be open source. And then we’re also seeing a lot of activity with the development of the EU AI Act. As part of the EU AI Act, some parties are pushing for an exemption of regulation for groups that develop open source models. So a lot of discussion around open source. I think the goal of this paper was to cut through what is becoming a quickly polarized debate.

We’re finding that you have people that see the benefits of open-sourcing and are just very, very hardcore pro open source and then won’t really hear any of the arguments around the potential dangers. And then you have people who are very anti open source, only want to talk about the dangers and sort of refuse to see the benefits. It just makes it really hard to have any kind of coherent dialogue around this question. Yet at the same time, country governments are trying to make very real policy around model sharing. So how can we cut through this polarized debate to just try and say, here’s the lay of the land. So what we’re trying to do in this paper is really provide a well-balanced, well-researched overview of both the risks and benefits of open-sourcing highly capable models.

By ‘highly capable models’, basically what we’re talking about is frontier AI systems. We just don’t use the term ‘frontier’ because the frontier moves, but we want to be clear that there will be some systems that might be highly capable enough and pose risks that are high enough that, even if they aren’t on the frontier, we should still be careful about open-sourcing them. So basically we’re talking about frontier models, but frontier keeps going. So we can get into the terminology debate a little bit more later. But basically we wanted to outline what are the risks and benefits of open-sourcing these increasingly highly capable models, but then not just to have yet another piece that says, well, here are the risks and here are the benefits. But then kind of cut through by saying ‘maybe there are also alternative ways that we can try to achieve some of the same benefits of open-sourcing, but at less risk.’

So the idea here is actually quite similar to the democratization paper: it’s to not just say ‘what is open-sourcing?’ and ‘is it good or bad?’ But to say ‘What are the specific goals of open-sourcing? Why do people say that open-sourcing is good, that it’s something we should be doing that should be defended, that should be preserved? So why is it we want to open source AI models? Then if we can be specific about why we want to open source, are there other methods that we can employ to try to achieve some of these same goals at less risk?’ These might be other model sharing methods like releasing a model behind API or doing a staged release, or it might be other more proactive measures like: let’s say one benefit of open-sourcing is that it can help accelerate research progress, both in developing more greater AI capabilities, but also promoting AI safety research.

Another way you can promote AI safety research is to dedicate a certain percentage of profits towards AI safety research or to have research collaborations. You can do these things without necessarily having to provide access for anybody to have access to the model. So this is what we were trying to do with this paper: really just say, okay, we’re going to give, first of all, just a very well-researched overview of both the risks and benefits. In fact, majority of the paper actually focuses on what are the benefits and trying to break down the benefits of open-sourcing and then really doing a deep dive into alternative methods or other things that can be done to help pursue these benefits where the risks might just be too high.

I think that the main claim that we do make in the paper with regard to ‘is open-sourcing good or bad?’ First of all, it’s a false dichotomy, but I think our main statement is open-sourcing is overwhelmingly good. Open source development of software, open source software underpins all of the technology we’re using today. It’s hugely important. It’s been great for technological development, for the development of safe technology that reflects lots of different user needs and interests. Great stuff. The issue is that there might be some cutting edge, highly capable AI systems that pose risks of malicious use or the proliferation of dangerous capabilities or even proliferation of harms and vulnerabilities to downstream applications.

Where these risks are extreme enough, we need to take a step back and say: maybe we should have some processes in place to decide whether or not open-sourcing these systems is okay. So I think the main thrust of the report is [that] open-sourcing is overarching good, there just may be cases in which it’s not the best decision and we need to be careful in those cases. So yeah, I think that’s the main overview of the paper.

Daniel Filan: Gotcha. Yeah, in case listeners were put off by the 60 something pages, I want listeners to know there is an executive summary. It’s pretty readable.

Elizabeth Seger: It’s a two-page executive summary. Don’t worry.

Daniel Filan: Yeah. I read the paper and I vouch for the executive summary being a fair summary of it. So don’t be intimidated, listeners.

Elizabeth Seger: I’ve also been putting off writing the blog post, so there will be a blog post version at some point, I promise.

Daniel Filan: Yeah. So it’s possible that might be done by the time we’ve released this, in which case we’ll link the blog post.

Elizabeth Seger: Maybe by Christmas.

Risks from open-sourcing

Daniel Filan: Yeah. Hopefully you’ll have a Christmas gift, listeners. So when we’re talking about these highly capable foundation models and basically about the risks of open-sourcing them… as I guess we’ve already gotten into, what is a highly capable foundation model is a little bit unclear.

Elizabeth Seger: Yeah. It’s already difficult.

Daniel Filan: Maybe it’ll be easier to say, what kinds of risks are we talking about? Because I think once we know the kinds of risks, (a) we know what sorts of things we should be willing to do to avert those risks or just what we should be thinking of, and (b) we can just say, okay, we’re just worried about AIs that could potentially cause those risks. So what are the risks?

Elizabeth Seger: In fact, I think framing it in terms of risks is really helpful. So generally here we’re thinking about risks of significant harm, significant societal or even physical harm or even economic harm. And of course now you say, “oh, define ‘significant’”. But basically just more catastrophic, extreme, significant societal risks and harms. So we use some examples in the paper looking at potential for malicious use.

There are some more diffuse harms if you’re thinking about political influence operations. So this kind of falls into the misinformation/disinformation discussion. How might AI systems be used to basically influence political campaigns? So disinformation undermines trust and political leaders. And then in turn, if you can disrupt information ecosystems and disintegrate the processes by which we exchange information or even just disintegrate trust in key information sources enough, that can also impact our ability as a society to respond to things like crises.

So the pandemic is a good example of, if it’s really hard to get people accurate information about, for example, whether or not mask-wearing is effective, you’re not going to have really good coordinated decision-making around mask-wearing. So I think this is one where it’s a little bit more diffuse. It’s harder to measure. So I think this is one point where it’s quite difficult to identify when the harm is happening because it’s so diffused. But I think this is one potential significant harm. I’d say maybe thinking about disrupting major political elections or something like that.

There’s some options for malicious use that are talked about a little more frequently, I think because they’re more well-defined and easier to wrap your head around. These are things like using generative AI to produce biological weapons or toxins, even production of malware to mount cyber attacks against key critical infrastructure. Imagine taking down an electrical grid, or imagine on election day, taking down an electoral system. These could have significant societal impacts in terms of harm to society or physical harm.

I think one key thing to point out here is that we aren’t necessarily seeing these capabilities already. Some people may disagree, but I think my opinion is that technologies with the potential for this kind of harm do not currently exist. I think it is wise to assume that they will come into existence in the not too distant future, largely because we’re seeing indications of these kinds of capabilities developing. We’ve already seen how even narrow AI systems that are used in drug discovery, if you flip that parameter that’s supposed to optimize for non-toxicity to optimize for toxicity, now you have a toxin generator. So it doesn’t take a huge stretch of the imagination to see how more capable systems could be used to cause quite significant societal harm. So no, I don’t think there’s currently systems that do this. I think we need to be prepared for a future in which these systems do exist. So it’s worth thinking about the wisdom of releasing these systems now even if we might not be present with the technology at this moment.

Daniel Filan: So as a follow-up question: you’re currently on the AI X-risk Research Podcast. I’m wondering… so it seems like you’re mostly thinking of things less severe than all humans dying or just permanently stunting the human trajectory.

Elizabeth Seger: I wouldn’t say necessarily. I think this is definitely a possibility. I think part of what’s happening is you kind of have to range how you talk about these things. I am concerned about x-risk from AI, and I think we could have systems where whether it’s the cyber capability or the biological terror capability or even the taking down political systems capability and bolstering authoritarian governments - these are things that could cause existential risk and I am personally worried about this. But [we’re] trying to write papers that are digestible to a much wider population who might not be totally bought into the x-risk arguments. I think, x-risk aside, there are still very genuine arguments for not necessarily releasing a model because of the harm that it can cause even if you don’t think that those are existential risks. Even if it’s just catastrophic harm, probably still not a great idea.

Daniel Filan: So at the very least, there’s a range going on there.

Elizabeth Seger: Yeah.

Should we make AI too dangerous to open source?

Daniel Filan: So one question I had in my mind, especially when thinking about AI that can cause these kinds of risks and potentially open-sourcing it… Part of me is thinking, it seems like a lot of the risks of open-sourcing seem to be, well, there’s now more people who can use this AI system to do a bunch of harm. The harm is just so great that that’s really scary and dangerous. So one thing I was wondering is: if it’s so dangerous for many people to have this AI technology, is it not also dangerous for anybody at all to have this AI technology? How big is this zone where it makes sense to make the AI but not to open source it? Or does this zone even exist?

Elizabeth Seger: Yeah, you make a very genuine point, and there are those that would argue that we should just hit the big old red stop button right now. I think that is an argument some people make. I guess where we’re coming from with respect to this paper is trying to be realistic about how AI development’s probably going to happen and try to inform governance decisions around what we should do as AI development continues. I think there are differing opinions on this, probably even among the authors on the paper. There’s what, 25, 26 authors on the paper.

Daniel Filan: The paper has a disclaimer, listeners, that not all authors necessarily endorse every claim in the paper.

Elizabeth Seger: That includes me. So we have a very, very broad group of authors. But I think… Sorry, what was your original question? It just flew out of my head.

Daniel Filan: If it’s too dangerous to open source, is it also too dangerous to make?

Elizabeth Seger: Yeah, no. So I think there’s definitely an argument for this. I guess where I’m coming from is trying to be realistic about where this development process is going and how to inform governments on what to do as AI development continues. It may very well be the case that we get to a point where everyone’s convinced that we just have a big coordinated pause/stop in AI development. But assuming that that’s not possible or improbable, shall we say, I think it’s still wise to have a contingency plan. How can we guide AI development to be safe and largely beneficial and reduce the potential for risk? I think there’s also an argument to be made that increasingly capable systems, while they pose increasingly severe risks, also could provide increasingly great benefits and potential economic potential benefits for helping to solve other challenges and crises that we face.

So I think you’d be hard-pressed to get a giant coordinated pause. So the next plan is how do we make AI development happen safely? It is a lot easier to keep people safe if the super dangerous ones are not spread widely. I guess that’s the very simplistic view. Yeah.

So I think, at least for me, I think it just comes from a realism about, are we likely to pause AI development before it gets super scary? Probably not, just given how humanity works. We developed nuclear weapons, probably shouldn’t have done that. Oops, well now they exist, let’s deal with it. So I think having a similar plan in line for what do we do with increasingly capable AI systems is important, especially given that it might not be that far away. Like I said, sort of with each new generation of AI that’s released, we all kind of hold our breaths and say, oh, well what’s that? What’s that one going to do? Then we learn from it and we don’t know what the next generation of AI systems is going to bring. And so having systems in place to scan for potential harms, potential dangers, capabilities, to inform decisions about whether or not these systems should be released and if and how they should be shared, that’s really important. We might not be able to coordinate a pause before the big scary happens. So, I think it’s important to discuss this regardless.

Daniel Filan: I got you.

Elizabeth Seger: That’s a technical term by the way, the big scary.

Offense-defense balance

Daniel Filan: Makes sense. So speaking of the risks of open-sourcing AI, in the paper you talk about the offense-defense balance, and basically you say that bad actors, they can disable misuse safeguards, they can increase new dangerous capabilities by fine-tuning. Open-sourcing makes this easier, it increases attacker knowledge. And in terms of just the AI technology, you basically make a claim that it’s tilted towards offense in that attackers can do… They get more knowledge, they can disable safeguards, they can introduce new dangerous capabilities. It’s easier to find these problems than it is to fix them. And once you fix them, it’s hard to make sure that everyone has the fixes. Would you say that’s a fair summary of-

Elizabeth Seger: Yeah, I’d say that’s pretty fair. I think the one thing I’d add… I’d say that with software development, for example… so offense-defense balance is something that’s often discussed in terms of open-sourcing and scientific publication, especially any time you have dual-use technology or scientific insights that could be used to cause harm, you kind of have to address this offense-defense balance. Is the information that’s going to be released going to help the bad actors do the bad things more or less than it’s going to help the good actors do the good things/prevent the bad actors from doing the bad things? And I think with software development, it’s often in favor of defense, in finding holes and fixing bugs and rolling out the fixes and making the technology better, safer, more robust. And these are genuine arguments in favor of why open-sourcing AI systems is valuable as well.

But I think especially with larger, more complex models, we start veering towards offense balance. I just want to emphasize, I think one of the main reasons for this has to do with how difficult the fixes are. So in the case of software, you get bugs and vulnerabilities that are relatively well-defined. Once you find them, relatively easy to fix, roll out the fix. The safety challenges that we’re facing with highly capable AI systems are quite complex. We have huge research teams around the world trying to figure them out, and it’s a very resource-intensive process, takes a lot of talent. Vulnerabilities are still easy to find, they’re just a lot harder to fix. So, I think this is a main reason why the offense-defense balance probably skews more towards offense and enabling malicious actors because you can still find the vulnerabilities, you can still manipulate the vulnerabilities, take advantage of them… harder to fix, even if you have more people involved. So, that’s the high-level evaluation. Mostly, I just wanted to push on the safety issue a little harder.

KataGo as a case study

Daniel Filan: Gotcha. So, one thing this actually brings up for me is… you may or may not be familiar, so there’s AI that plays the board game Go. There’s an open-source training run and some colleagues of mine have found basically you can cheaply find adversarial attacks. So, basically dumb Go bots that play in a way that confuses these computer AI policies. And these attacks, they’re not generally smart, but they just push the right buttons on the AI policy. And I guess this is more of a comment than a question, but there’s been enough rounds of back and forth between these authors and the people making these open source Go bots that it’s potentially interesting to just use that as a case study of the offense-defense balance.

Elizabeth Seger: So, you’re saying that given that the system was open source and that people could sort of use it and query it and then send the information back to the developers, that that’s-

Daniel Filan: Yeah, and in fact, these authors have been working with the developers of this software, and in fact what’s happened is the developers of this software - KataGo, if people want to google it - they’ve seen this paper, they’re trying to implement patches to fix these issues. The people finding the adversarial policies are basically checking if the fixes work and publishing the information about whether they do or not.

Elizabeth Seger: Absolutely, so I want to be really clear that this is a huge benefit of open source development: getting people involved in the development process, but also using the system, finding the bugs, finding the issues, feeding that information back to the developers. This is a huge benefit of open source development for software and AI systems. I think that this is specifically why the paper focuses on the big, highly capable frontier foundation models is that this gets more difficult the more big, complex, cutting edge the system is. Some bugs will still be relatively small, well-defined, and there are bug bounty programs, proposals for AI safety bounty programs as well, helping to find these vulnerabilities and give the information back to the developers. I think there are issues though, with respect to some of the larger safety issues that are more difficult. Sometimes it’s difficult to identify the safety problems in the first place, more difficult to address the safety problems.

Then, there’s also the issue of just rolling out the fixes to the system. So software development, you fix a bunch of bugs, you roll out an update. Oftentimes, in the license it’ll say that you’re supposed to actually use the updated version. There’s some data that came out, I can’t remember which organization it was right now, I’ll have to look it up later, but it’s actually quite a low uptake rate of people actually running the up-to-date software. So first of all, even with just normal software, it’s hard to guarantee that people are actually going to run the up-to-date version and roll out those fixes, which is an issue with open source because if you’re using something behind API, then you just update the system and then everyone’s using the updated system. If you have an open source system, then people actually have to download and run the update version themselves.

With foundation models, there’s actually a weird incentive structure that changes where people might actually be de-incentivized to update. So with software, oftentimes when you have an update, it fixes some bugs and it improves system functionality. When it comes to safety fixes for foundation models, oftentimes it has to do with reducing system functionality, like putting on a filter that says, “Well, now you can’t produce this class of images. Now, you can’t do this kind of function with the system,” so it’s hard, I don’t know if there’s good information on how this has actually panned out now. Are we seeing lower uptake rates with updates for AI systems? I don’t know, but there might be something weird with incentive structures going on too, where if updates basically equate to reducing system functionality in certain ways, people might be less likely to actually take them on board.

Daniel Filan: I don’t have a good feel for that.

Elizabeth Seger: I don’t have a super good feel, but just, I don’t know, interesting food for thought, perverse incentive structures.

Daniel Filan: I don’t know, I’m still thinking about this KataGo case: so that’s the case where the attack does reduce the system functionality and people are interested in getting the latest version with fixes. It also occurs to me that… So, in fact the structure of this paper, the way they found the attack did not rely on having access to the model weights. It relied on basically being able to query the Go bot policy, basically to try a bunch of things and figure out how to trick the Go bot policy. Now, it’s really helpful if you can have the weights locally just so that you can call the API, so that you can call it a lot, but that was not a case where you needed the actual weights to be shared. So on the one hand, that’s a point that sharing the weights is less valuable than you might think, but it also suggests if you’re worried about people finding these adversarial attacks, then just putting the weights behind an API doesn’t protect you as much as you think. Maybe you need to rate limit or something.

Elizabeth Seger: I think that’s a valuable insight, there are definitely things you can do without weights. This is an argument for why you should be worried anyway, but it’s also an argument for… There are lots of arguments for, well, open-sourcing is important because you need it to do safety research and getting more people involved in safety research will result in safer systems, you have more people input into these processes, but you just illustrated a perfect example of how just having query access, for example, to a system, can allow you to do a significant amount of safety research in terms of finding vulnerabilities.

Openness for interpretability research

Elizabeth Seger: So, query access is one that can be done completely behind an API, but then even if we think about something like interpretability research, interpretability research does require much more in-depth access to a system to do, but arguably this is an argument for needing access to smaller systems. We are struggling to do interpretability research on smaller, well-defined systems. It’s sort of like the rate limiting factor on interpretability research isn’t the size of the models people have access to, the way I understand it at least. If we’re struggling to do interpretability research on smaller models, I feel like having access to the biggest, most cutting edge frontier model is not what needs to happen to drive interpretability research.

Daniel Filan: I think it depends on the kind of interpretability research.

Elizabeth Seger: It’d be interesting to hear your thoughts on this as well, but there’s a range of different kinds of AI research. Not all of it requires open access and then some of the kinds that do require open access to the models isn’t necessarily helped the most by having the open access, and then there is also this idea of alternative approaches that we talk about in the paper. You can help promote AI safety research by providing access to specific research groups. There are other things you can do to give people the access they need to do safety research.

Daniel Filan: So, I guess I can share my impressions here. Interpretability research is a broad bucket. It describes a few different things. I think there are some kinds of things where you want to start small and we haven’t progressed that far beyond small. So just understanding, ‘can we just exhaustively understand how a certain neural net works?’ - start small, don’t start with GPT-4. But I think one thing you’re potentially interested in, in the interpretability context, is how things get different as you get bigger models, or do bigger models learn different things or do they learn more? What sorts of things start getting represented? Can we use interpretability to predict these shifts? There you do want bigger models. In terms of how much can you do without access to weights, there’s definitely a lot of interpretability work on these open source models because people apparently really do value having the weights.

Even in the case of the adversarial policies work I was just talking about, you don’t strictly need access to the weights, but if you could run the games of Go purely on your computer, rather than calling the API, waiting for your request to be sent across the internet, and the move to be sent back, and doing that a billion times or I don’t know the actual number, but it seems like just practically-

Elizabeth Seger: Much more efficient.

Daniel Filan: … It’s easier to have the model. I also think that there are intermediate things. So one thing the paper talks about, and I guess your colleague Toby Shevlane has talked about is basically structured access of giving certain kinds of information available maybe to certain people or maybe you just say “these types of information are available, these types aren’t”. I’ve heard colleagues say, “Even if you didn’t open source GPT-4 or GPT-3, just providing final layer activations or certain kinds of gradients could be useful”, which would not… I don’t think that would provide all the dangers or all the risks that open-sourcing could potentially involve.

Elizabeth Seger: I think this is a really key point as well, is trying to get past this open versus closed dichotomy. Just saying that something isn’t open source doesn’t necessarily mean that it’s completely closed and no-one can access it. So like you said, Toby Shevlane talks about structured access, and it was a paper we referenced - at least when we referenced it, it was still forthcoming, it might be out now - but it was Toby Shevlane and Ben Bucknell were working on it, and it was about the potential of developing research APIs. So, how much access can you provide behind API to enable safety research and what kind of access would that need to look like and how could those research APIs be regulated and who by? So, I think if there’s a genuine interest in promoting AI safety research and a genuine acknowledgement of the risks of open-sourcing, we could put a lot of resources into trying to develop and understand ways to get a lot of the benefits to safety research that open-sourcing would have by alternative means.

It won’t be perfect. By definition, it’s not completely open, but if we take the risks seriously, I think it’s definitely worth looking into these alternative model sharing methods and then also into the other kinds of proactive activities we can engage in to help promote safety research, whether that’s committing a certain amount of funds to safety research or developing international safety research organizations and collaborative efforts. I know one issue that always comes up when talking about, “Well, we’ll just provide safety research access through API, or we’ll provide privileged downloaded access to certain groups.” It’s like, “Well, who gets to decide who has access? Who gets to do the safety research?”

And so, I think this points to a need to have some sort of a multi-stakeholder governance body to mediate these decisions around who gets access to do the research, whether you’re talking about academic labs or other private labs, sort of like [how] you have multi-stakeholder organizations decide how to distribute grants to do environmental research, or you have grantmaking bodies that distribute grant funds to different academic groups. You could have a similar type situation for distributing access to more highly capable, potentially dangerous systems, to academic groups, research groups, safety research institutions that meet certain standards and that can help further this research.

So, I feel like if there’s a will to drive safety research forward, and if varying degrees of access are needed to allow the safety research to happen, there are things we can do to make it happen that do not necessarily require open-sourcing a system. And I think, like we said, different kinds of safety research require varying degrees of access. It’s not like all safety research can be done with little access. No, you need different amounts of access for different kinds of safety research, but if there’s a will, there’s a way.

Effectiveness of substitutes for open sourcing

Daniel Filan: So, I want to ask something a bit more quantitative about that. So, some of the benefits of open-sourcing can be gained by halfway measures or by structured access or pursuing tons of collaborations, but as you mentioned, it’s not going to be the same as if it were open sourced. Do you have a sense of… I guess it’s going to depend on how constrained you are by safety, but how much of the benefits of open source do you think you can get with these more limited sharing methods?

Elizabeth Seger: That’s a good question. I think you can get quite a bit, and I think, again, it sort of depends what kind of benefit you’re talking about. So in the paper, I think we discuss three different benefits. Let’s say we talk about accelerating AI research, so that’s safety research and capability research. We talk about distributing influence over AI systems, and this ranges everything from who gets to control the systems, who gets to make governance decisions about the systems, who gets to profit. It wraps all the democratization themes together under distributing influence over AI, and then, let’s see, what was the other one that we talked about? You’d think I’ve talked about this paper enough in the last three months, I have it down. Oh, external model evaluation. So, enabling external oversight and evaluation, and I think it depends which one you’re talking about.

Let’s start with external model evaluation. I think that this probably benefits the most from open-sourcing. It depends what you’re looking at, so for example, if you’re just looking for minor bugs and stuff like that, you don’t need open source access for that, but having a more in-depth view to the systems is more important for people trying to help find fixes to the bugs. We’ve discussed this. There are also risks associated with open-sourcing. If we’re talking about accelerating capability research, for example, which sort of falls under the second category, I think you might find that the benefits of open-sourcing here might be somewhat limited the larger and more highly capable the system gets. And I think this largely will just have to do with who has access to the necessary resources to really operate on the cutting edge of research and development. Open source development, it operates behind the frontier right now largely because of restrictions… not restrictions, but just the expense of the necessary compute resources.

And then you talk about distributing control over AI, we’ve already discussed the more distributed effect of open-sourcing and model sharing on distributing control. It’s a second order effect: you get more people involved in the development process and then large labs have more competition, and then it distributes influence and control.

There are probably more direct ways you can help distribute control and influence over AI besides making a system widely available. So, to answer your original question then about, how much of the benefit of open-sourcing can you get through alternative methods? I guess it really depends what benefit you’re talking about. I think for AI safety progress probably quite a bit, honestly; actually the vast majority of it, given that a lot of the safety research that’s done on these highly capable, cutting edge models is something that has to happen within well-resourced institutions anyway, or you need the access to the resources to do that, not just the code and the weights, but the computational resources and so on.

So, I think quite a bit. I think it’s less of a “can we get close to the same benefits that open-sourcing allows?” It’s more like, “can we do it in one fell swoop?” That’s the thing. It’s like open-sourcing is the easy option. “Here, it’s open!” - and now you get all these benefits from open-sourcing. The decision to open-source or not, part of the reason it’s a hard decision is because achieving these benefits by other means is harder. It’s going to take more resources to invest, more organizational capacity, more thought, more cooperation. It’s going to take a lot of infrastructure, a lot of effort. It’s not the one-stop shop that open-sourcing is, but I think the idea is that if the risks are high enough, if the risks are severe enough, it’s worth it. I think that’s where it comes in.

So, I guess it’s worth reiterating again and again: this paper is not an anti-open source paper, [it’s] very pro-open source in the vast majority of cases. What we really care about here are frontier AI systems that are starting to show the potential for causing really catastrophic harm, and in these cases, let’s not open-source and let’s pursue some of these other ways of achieving the same benefits of open source to safety and distributing control and model evaluation, but open-source away below that threshold. The net benefits are great.

Offense-defense balance, part 2

Daniel Filan: Gotcha. So my next question - I actually got a bit sidetracked and wanted to ask it earlier - so in terms of the offense-defense balance, in terms of the harms that you are worried about from open-sourcing, I sometimes hear the claim that basically, “Look, AI, if you open-source it, it is going to cause more harm, but you also enable more people to deal with the harm.” So I think there, they’re talking about offense-defense balance, not of finding flaws in AI models, but in the underlying issues that AI might cause. So, I guess the idea is something… To caricature it, it’s something like, “Look, if you use your AI to create a pathogen, I can use my AI to create a broad spectrum antibiotic or something”, and the hope is that in these domains where we’re worried about AI causing harm, look, just open-sourcing AI is going to enable tons of people to be able to deal with the harm more easily, as well as enabling people to cause harm. So I’m wondering, what do you think about the underlying offense-defense balance as opposed to within AI?

Elizabeth Seger: I get the argument. Personally, I’m wary about the arms race dynamic though. You gotta constantly build the stronger technology to keep the slightly less strong technology in check. I guess this comes back to that very original question you asked about, “What about just hitting the no more AI button?” So, I guess I get the argument for that. I think there’s weird dynamics, I don’t know. I’m not doing a very good job answering this question. I’m personally concerned about the race dynamic here, and I think it just comes back to this issue of, how hard is it to fix the issues and vulnerabilities in order to prevent the misuse in the first place? I think that should be the goal: preventing the misuse, preventing the harm in the first place. Not saying, “Can we build a bigger stick?”

There’s a similar argument that is brought up when people talk about the benefits of producing increasingly capable AI systems and saying, “Oh, well, we need to plow ahead and build increasingly capable AI systems because you never know, we’ll develop a system that’ll help cure cancer or develop some renewable energy technology that’ll help us address climate change or something like that.” What huge problems could AI help us solve in the future? And I don’t know - this is personally me, I don’t know what the other authors on this paper think of this - but I don’t know, I kind of feel like if those are the goals, if the goals are to solve climate change and cure cancer, take the billions upon billions upon billions and billions of dollars that [we’re] currently putting into training AI systems and go cure cancer and develop renewable technologies! I struggle with those arguments personally. I’d be interested just to hear your thoughts. I have not written about this. This is me riffing right now. So, I’d be interested to hear your thoughts on this train of thought as well.

Daniel Filan: I think the original question is unfairly hard to answer just because it’s asking about the offense-defense balance of any catastrophic problem AI might cause and it’s like, “Well, there are tons of those and it’s pretty hard to think about.” So, the thing you were saying about, if you wanted to cure cancer, maybe step one would not be “create incredibly smart AI”. I’ve seen this point. I don’t know if you know David Chapman’s Better without AI?

Elizabeth Seger: No, not familiar.

Daniel Filan: So, he basically argues we just shouldn’t build big neural nets and it’s going to be terrible. Also, Jeffrey Heninger at AI Impacts, I think has said something similar along these lines. On the one hand, I do kind of get it, just in the sense that, if I weren’t worried about misaligned AI, there’s this hope that this is the last invention you need. You create AI and now instead of having to separately solve cancer and climate change and whatever, just make it solve those things for you.

Elizabeth Seger: I guess it’s just really hard to look forward, and you have to decide now whether or not this technology is that silver bullet and how much investment it’s going to take to get to that point.

Daniel Filan: I think that’s right, and I think your take on this is going to be driven by your sense of the risk profile of building things that are significantly smarter than us. I guess from the fact that I made the AI X-risk Research Podcast, rather than the AI Everything’s Going to be Great Research Podcast, people can guess my-

Elizabeth Seger: It’s an indication of where you’re coming from.

Daniel Filan: … take on this, but I don’t know. I think it’s a hard question. So, part of my take is, in terms of the underlying offense-defense balance, I think it becomes more clear when you’re worried about, what should I say, silicogenic risks? Basically the AI itself coming up with issues rather than humans using AI to have nefarious schemes. Once you’re worried about AI doing things on their own where you are not necessarily in control, there I think it makes sense that you’re probably… If you’re worried about not being able to control the AIs, you’re probably not going to be able to solve the risks that the AIs are creating, right?

Elizabeth Seger: Yeah, your management plan for AI shouldn’t be to build a slightly more powerful AI to manage your AI.

Daniel Filan: Well, if you knew that you were going to remain in control of the slightly bigger AI, maybe that’s a plan, but you kind of want to know that.

Elizabeth Seger: I guess I was saying that if you’re worried about loss of control scenarios, then the solution shouldn’t be, “Well, let’s build another system that’s also out of our control, but just slightly better aligned to address the…” I feel like that’s-

Daniel Filan: It’s not the greatest. I think my colleague John Wentworth has some saying, “Releasing Mothra to contain Godzilla is not going to increase property values in Tokyo,” which is a cute little line. I don’t know, it’s a hard question. I think it’s hard to say anything very precise on the topic. I did want to go back to the offense-defense balance. So moving back a bit, a thing you said was something like, “Look, it’s probably better to just prevent threats from arising, than it is to have someone make a pathogen and then have everyone race to create an antibiotic or antiviral or whatever.” So, that’s one way in which everyone having really advanced AI… That’s one way that could look in order to deal with threats. I think another way does look a bit more like prevention. I don’t know, it’s also more dystopian sounding, but one thing that AI is good at is surveillance, right?

Elizabeth Seger: Yes.

Daniel Filan: Potentially, so you could imagine, “Look, we’re just going to open source AI and what we’re going to use the AI for is basically surveilling people to make sure the threats don’t occur.” So, maybe one version of this is you just really amp up wastewater [testing]. Somehow you use your AI to just look at the wastewater and see if any new pathogens are arising. It could look more like you have a bunch of AIs that can detect if other people are trying to use AI to create superweapons or whatever, and stop them before they do somehow.

Elizabeth Seger: The wastewater example, that sounds great. We should probably do that anyway. In terms of surveilling to see how people are using AI systems using AI, why not just have the AI systems be behind an API where people can use the systems for a variety of downstream tasks integrating through this API and then the people who control the API can just see how the system is being used? Even if it can be used for a vast majority of tasks, even if you were to take all the safety filters off, the advantage of the API is still that you can see how it’s being used. I don’t know, I feel like that’s-

Making open-sourcing safer?

Daniel Filan: That seems like a good argument. All right, so I guess another question I have is related to the frame of the report. So in the report you’re basically like, “open-sourcing has these benefits, but it also has these costs. What are ways of doing things other than open-sourcing that basically try and retain most of the benefits while getting rid of most of the costs?” You can imagine a parallel universe report where you say, “Okay, open-sourcing has these benefits, but it also has these costs. We’re still going to open source, but we’re going to do something differently in our open source plan that is going to retain benefits and reduce costs”, right? So one example of this is you open-source models, but you have some sort of watermarking or you have some sort of cryptographic backdoor that can stop models in their tracks or whatever. I’m wondering: why the frame of alternatives to open-sourcing rather than making open-sourcing better?

Elizabeth Seger: Very simple. I think making open-sourcing better is the harder question, technically more difficult. I mean, for example, say you have watermarking, part of the issue with watermarking to identify artificially generated images is making sure the watermarks stick. How do you make sure that they are irremovable if you are going to open… This is a really complex technical question: how do you develop a system that has watermarked images where that watermark is irremovable if you were to open source the system?

I’m not saying it’s undoable. I personally don’t have the technical background to comment very deeply on this. I have heard people talking about how possible this would be. It also depends how you watermark, right?

If you have just a line of inference code that says, “slap a watermark on this thing”, it could delete the line of inference code. If you’re to train the system on only watermarked images, well now you have to retrain the entire system to get it to do something else, which is very expensive. So again, I think it depends how you do it.

I was at a meeting last week where people were talking about, are there ways we could build in a mechanism into the chips that run the systems that say “if some bit of code is removed or changed in the system, then the chip burns up and won’t run the system”. I’m not saying this is impossible, but [a] really interesting technical question, really difficult, definitely beyond my area of expertise. But I think if this is an approach we can take and say, there are ways to be able to open source the system and get all the benefits of open-sourcing by just open-sourcing and still mitigate the risks, I think that’s great. I think it’s just a lot more difficult.

And there’s one aspect in which we do take the flip view in the report, and I think this is where we start talking about a staged release of models. You can undergo a staged release of a model where you put out a slightly smaller version of a model behind API, you study how it’s being used. Maybe you take a pause, analyze how it was used, what [are] the most common avenues of attack, if at all, [that] were being used to try and misuse the model.

And then you release a slightly larger model, [then] a slightly larger model. You do this iteratively, and if you do this process, as you get to a stage where it’s like, hey, we’ve been doing the staged release of this model for however many months and no problems, looking good, there’s no emergent capabilities that popped up that are making you worried. You didn’t have to implement a bunch of safety restrictions to get people to stop doing unsafe things - okay, open source. This is not a binary [of] it has to be completely open or completely closed. And I think this is one respect… If you were to take this flip view of “how can we open source, but do it in the safest way possible?” Just open source slowly, take some time to actually study the impacts. And it’s not like the only way to get a sense of how the system’s going to be used is to just open source it and see what happens. You could do a staged release and study what those impacts are.

Again, it won’t be perfect. You never know how it’s going to be used 10 years down the road once someone gets access to all the weights and stuff. But it is possible to study and get some sort of insight. And I think one of the nice things about staged release is if you start doing the staged release process and you realize that at each iterative step you are having to put in a bunch of safety filters, for example, to prevent people from doing really shady stuff, that’s probably a good indication that it’s not ready to be open sourced in its current form, because those are safety filters that will just immediately be reversed once open sourced. So I think you can learn a lot from that.

So I think that’s one way you can open source safely: find ways to actually study what the effects are before you open source, because that decision to open source is irreversible. And then I think the technical challenge of, are there ways we can have backstops, that we can technically build in irreversible, irremovable filters or watermarks or even just hardware challenges that we could implement - I think [those are] really interesting technical questions that I don’t know enough about, but… Go for it. That’d be a great world.

Daniel Filan: Yeah. If listeners are interested, this gets into some territory that I talked about with Scott Aaronson earlier this year. Yeah, I think the classic difficulties, at least say for watermarking… I read one paper that claims to be able to bake the watermark into the weights of the model. To be honest, I didn’t actually understand how that works.

Elizabeth Seger: I think it has to do with how the model’s trained. So the way I understand it is if you have a dataset of images that all have a watermark in that dataset, not watermark in the sense like you see on a $1 bill, but weird pixel stuff that the human eye can’t see. If all the images in the training dataset have that watermark, then all of the images it produces will have that watermark. In that case, it’s baked into the system because of how it was trained. So the only way to get rid of that watermark would be to retrain the system on images that don’t contain the watermark.

Daniel Filan: Yeah, that’s one possibility. So that’s going to be a lot rougher for applying to text models, of course, if you want to just train on the whole internet. I think I saw something that claimed to work even on cases where the dataset did not all have the watermark, but I didn’t really understand how it worked. But at any rate, the key issue with these sort of watermarking methods is as long as there’s one model that can basically paraphrase that does not have watermarking, then you can just take your watermark thing and basically launder it and get something that - if your paraphrasing model is good enough, you can create something that looks basically similar, it doesn’t have the watermark, and then it’s sad news. Yeah. And then-

Elizabeth Seger: Sorry, I was going to say there’s similar [things] in terms of how doing something with one model allows you to jailbreak another model. I mean this is what happened with the ‘Adversarial suffixes’ paper, where using a couple open source models, one which was Llama 2, and using the weights of those models, figuring out a way to basically just throw a seemingly random string of numbers at a large language model, and then with that seemingly random range of numbers before the prompt basically get the system to do whatever you want. Except while they figured out how to do that using the weights accessible from Llama 2, it worked on all the other large language models. So finding a way to jailbreak one model and using the weights and access to one model, that could bring up vulnerabilities in tons of others that aren’t open sourced as well. So I think that’s another roughly related somewhat to what we were just talking about point.

Daniel Filan: Yeah, I guess it brings up this high level thing of whatever governance method for AI you want, you want it to be robust to some small fraction of things breaking the rules. You don’t want the small fraction to poison the rest of the thing, which watermarking unfortunately has.

Yeah, I guess I wanted to say something brief about backdoors as well. So there is really a way of, at least in toy neural networks, and you can probably extend it to bigger neural networks, you really can introduce a backdoor that is cryptographically hard to detect. So one problem is, how do you actually use this to prevent AI harm is not totally obvious. And then there’s another issue of… I guess the second issue only comes up with super smart AI, but if you have a file on your computer that’s like, “I implanted a backdoor in this model, the backdoor is this input”, then it’s no longer cryptographically hard to find as long as somebody can break into your computer. Which hopefully is cryptographically hard, but I guess there are security vulnerabilities there.

So yeah, I wonder if you want to say a little bit about the safer ways to get the open source benefits. I’ve given you a chance to talk about them a little bit, but is there anything more you want to say about those?

Elizabeth Seger: I think, not really. I think the overarching point is, just as I said before, when the risks are high - and I think that’s key to remember, I’m not saying don’t open-source everything - when the risks are high, it is worth investing in seeing how else we can achieve the benefits of open-sourcing. Basically, if you’re not going to open-source because the risks are high, then look into these other options. It’s really about getting rid of this open versus closed dichotomy.

So many of the other options have to do with other options for sharing models, whether that’s structured access behind API, even research API access, gated download, staged release, and then also more proactive efforts. Proactive efforts which can actually also be combined with open-sourcing. They don’t have to be seen as an alternative to open-sourcing. So this is things like redistributing profits towards AI safety research or starting AI safety and bug bounty programs. Or even like we talked about with the democratization paper, thinking about how we can democratize decision-making around AI systems to help distribute influence over AI away from large labs, which is another argument for open-sourcing.

So yeah, I think that this is key: there are other efforts that can be put in place to achieve many of the same benefits of open-sourcing and when the risks are high, it’s worth really looking into these.

AI governance research

Daniel Filan: All right. Okay. So moving on, I want to talk a little bit more broadly about the field of AI governance research. So historically, this podcast is mostly focused on technical AI alignment research, and I imagine most listeners are more familiar with the technical side than with governance efforts.

Elizabeth Seger: In which case, I apologize for all my technical inaccuracies. One of the benefits of having 25 co-authors is that a lot of the technical questions I got to outsource.

The state of the field

Daniel Filan: Makes sense. Yeah, it’s good to be interdisciplinary. So this is kind of a broad question, but how is AI governance going? What’s the state of the field, if you can answer that briefly?

Elizabeth Seger: The state of the field of AI governance. Yeah. Okay. I’ll try and answer that briefly. It’s going well in that people are paying attention. In this respect, the release of ChatGPT I think was really great for AI governance because people, besides those of us already doing AI governance research, are really starting to see this as something valuable and important that needs to be talked about and [asking] questions around what role should governments play in regulating AI, if at all? How do we get this balance between governments and the developers? Who should be regulated with respect to different things? Do all of the responsibilities lie on the developers or is it on the deployers?

And all of these questions suddenly are coming to light and there’s more general interest in them. And so we’re seeing things like, the UK AI Summit is happening next week, [a] global AI summit, looking at AI safety, really concerned about catastrophic and existential risks, trying to understand what kind of global institutions should be in place to govern AI systems, to evaluate AI systems, to audit, to regulate.

And this is bringing in countries from all over the world. I think it’s something like 28 different countries are going to be at the UK AI Summit. You have the EU AI Act where it started a while ago looking at narrow AI systems, but now is taking on foundation models and frontier AI systems and looking at open source regulation. And this has really, over the last year, exploded into a global conversation.

So in that respect, AI governance is going well in that people are paying attention. It’s also very high stress because suddenly everyone’s paying attention. We have to do something. But I think there’s really genuine interest in getting this right, and I think that really bodes well. So I’m excited to see where this next year goes. Yeah, there’s talk about having this global AI summit and then making this a recurring series. And so I think it’s going well in the sense that people are paying attention and the wheels are starting to turn, and that’s cool.

Open questions

Daniel Filan: Okay. I guess related to that, what do you see as the most important open questions in the field?

Elizabeth Seger: In the field of AI governance? Okay. So I think one big one is compute governance, which my colleague, Lennart Heim, works on. This is just thinking about how compute is a lever for trying to regulate who is able to develop large models, even how compute should be distributed so that more people can distribute large models, but basically using compute as a lever to understand who has access to and who is able to develop different kinds of systems. So I think that’s a huge area of research with a lot of growing interest because compute’s one of the tangible things that we can actually control the flow of.

I think that the questions around model-sharing and open-sourcing are getting a lot of attention right now. Big open question, a lot of debate, like I said, it’s becoming really quite a polarized discussion, so it’s getting quite hard to cut through. But a lot of good groups [are] working on this, and I think a lot of interest in genuinely finding common ground to start working on this. I’ve had a couple situations where I’ve been in groups or workshops where we get people who are very pro-open source and other people who are just like, no, let’s just shut down the whole AI system right now, really both sides of the spectrum coming together. And we try and find a middle ground on, okay, where do we agree? Is there a point where we agree? And very often we can come to a point of agreement around the idea that there may be some AI system, some model that poses risks that are too extreme for that model to be responsibly open sourced.

And that might not sound like that extreme of a statement, but when you have people coming from such polarized views to agree on the fact that there may exist a model one day that should not be open source, that is a starting point and you can start the conversation from there. And every group I’ve been in so far has got to that point, and we can start working on that. So I think this model-sharing question is a big open question and lots of technical research needs to be done around benchmarking to decide, when are capabilities too dangerous?

Also around understanding what activities are actually possible given access to different combinations of model components. And that’s actually not entirely clear, and we need a much more fine-grained understanding of what you can actually do given different kinds of model, combinations of model components, in order not only to have safe standards for model release and really a fine-grained standard for model release, but also to protect the benefits of open-sourcing. You don’t want to just have a blanket “don’t release anything” if you can get a lot of good benefit out of releasing certain model components. So I think a lot of technical research has to go into this.

Anyway. So yeah, second point, I think model-sharing is a really big point of discussion right now. And then with the upcoming UK AI Summit, [there’s] quite a bit of discourse around what international governance structures should look like for AI, a lot of different proposed models. And yeah, it’ll be interesting to see what comes out of the summit. I don’t think they’re going to agree on anything amazing at the summit. It’s two days. But I think for me, a really great outcome of the summit would be, first, recognition from everyone that AI systems could pose really extreme risks. So just a recognition of the risks. And then second, a plan going forward, a plan for how we can start establishing international systems of governance and really structure out when are we going to come to what kinds of decisions and how can we start putting something together. So I think that those are probably three key open questions, and the international governance structure one is really big right now too, just given the upcoming summit.

Daniel Filan: And I guess unless we get the editing and transcription for this episode done unusually quickly, listeners, the UK AI Summit is almost definitely going to be in your past. So I guess listeners are in this interesting position of knowing how that all panned out in a way that we don’t. So that was open questions in the field broadly. I’m wondering for you personally as a researcher, what things are you most interested in looking at next?

Elizabeth Seger: Interesting. I mean, most of my life is taken up with follow-up on this open source report right now. So I definitely want to keep looking into questions around model-sharing and maybe setting responsible scaling policy, responsible model release policy. I’m not exactly sure. I think I’m in this place right now where I’m trying to feel out where the most important work needs to be done and whether the best place for me to do is to encourage other people to do certain kinds of work where I don’t necessarily have the expertise, like we were talking about, like needing more technical research into what is possible given access to different combinations of model components, or are there specific areas of research I could try to help lead in? Or whether really what needs to be done is just more organizational capacity around these issues.

So no, I am personally interested in keeping up with this model-sharing discussion. I think there’s a lot of interesting work that needs to be done here, and it’s a key thing that’s being considered within the discussions around international AI governance right now. Yeah, so sorry I don’t have as much of a clear cut answer there, but yeah, I’m still reeling from having published this report and then everything that’s coming off the back of it and just trying to feel out where’s the next most important, most impactful step, what work needs to be done. So I guess if any of your listeners have really hot takes on “oh, this is what you should do next”, I guess, please tell me. It’ll be very helpful.

Daniel Filan: How should they tell you if someone’s just heard that and they’re like, oh, I need to-

Elizabeth Seger: “I need to tell her now! She must know!” Yeah, so I mean, I have a website where you could find a lot of my contact information, or you can always find me on LinkedIn. I spend far too much time on LinkedIn these days. And also my email address happens to be on the open source report. So if you download the report, my email address is there.

Daniel Filan: What’s the URL of your website?

Elizabeth Seger: ElizabethSeger.com.

Distinctive governance issues of x-risk

Daniel Filan: All right. Okay. Getting back to talking about governance in general, I’m wondering… so I guess this is an x-risk-focused podcast. How, if at all, do you think governance research looks different when it’s driven by concerns about x-risk mitigation versus other concerns you could have about AI governance?

Elizabeth Seger: Well, that’s a good question. Let’s see. So the question is how does the governance research look different?

Daniel Filan: Yeah. What kinds of different questions might you focus on, or what kinds of different focuses would you have that would be driven by x-risk worries rather than by other things?

Elizabeth Seger: So this is something that I’ve had to think about a lot in my own research development because I did not come into this area of research from an x-risk background interest. I came into it… I mean, honestly, I started in bioethics and then moved from bioethics looking at AI systems in healthcare and have sort of moved over into the AI governance space over a very long PhD program. And so here I am.

But I would say one of the things that I’ve learned working in the space [that’s] more interested in long-term x-risk impacts of AI and trying to prevent x-risks is really paying attention to causal pathways and really trying to be very critical about how likely a potential pathway is to actually lead to a risk. I don’t know if I’m explaining this very well.

Maybe a better way of saying it’s like: if you have a hypothesis, or let’s say you’re worried about the impacts of AI systems on influence operations or impacting political campaigns, I find it really helpful to start from the hypothesis of, it won’t have an impact. And really just trying to understand how that might be wrong, as opposed to trying to start from “oh, AI is going to pose a massive bio threat, or it’s going to pose a massive threat to political operations” or something like that. And then almost trying to prove that conclusion.

Yeah, I don’t know, I start from the opposite point and then try and think about all the ways in which I could be wrong. And I think this is really important to do, especially when you’re doing x-risk research, whether it’s with respect to AI or some other form of x-risk. Because I think there are a lot of people that turn off when you start talking about existential risks, they think it’s too far out there, it’s not really relevant to the important questions that are impacting people today, the tangible things that people are already suffering. And so I think it’s really important to be very, very rigorous in your evaluations and have a very clear story of impact for why it is that you’re doing the research you’re doing and focusing on the issues that you’re doing. At least that’s been my experience, trying to transition into the space and work on these issues.

Technical research to help governance

Daniel Filan: Another question I have, related to my audience… So I think my audience, a lot of them are technical alignment researchers and there are various things they could do, and maybe they’re interested in, okay, what work could technical alignment people do that would make AI governance better? I’m wondering if you have thoughts on that question.

Elizabeth Seger: Okay. Technical alignment people, AI governance better. Yeah, I mean there’s a lot of work going on right now, especially within the UK government. We just set up the UK AI Task Force, a government institution doing a lot of model evals and alignment research. I think if you have the technical background in alignment research, you are very much needed in the governance space. There’s very often a disconnect between… I mean, I am also guilty of this. There’s a disconnect between the people doing the governance research and the people who have the experience with the technology and really know the ins and outs of the technology that’s being developed.

And I think if you have the inclination to work in an AI governance space and help bridge that gap, that would be incredibly valuable. And like I’ve already said, some of the more technical questions, even around open-sourcing, are things that I was very, very glad to have colleagues and co-authors on the paper who have worked for AI labs and stuff before and really knew what they were talking about and could advise and help write some of the more technical aspects of the report.

So I think if you have the inclination to work in the space, to get involved with governance efforts, or even maybe some of these government institutions that are starting to pop up that are working on the boundary of AI governance and technical research, that could be a really valuable place to contribute. So I think my 2 cents off the top of my brain would be help bridge that gap.

Daniel Filan: Okay. Great. So before we wrap up, I’m wondering if there’s anything that you wish I’d asked but that I didn’t?

Elizabeth Seger: Oh, that’s a good question. No, I don’t think so. I think we’ve covered a lot of good stuff. Yeah, thank you for having me on really. I’d say there’s nothing in particular. This has been great.

Following Elizabeth’s research

Daniel Filan: All right, so to wrap up then, if people are interested in following your research, following up on this podcast, how should they do that?

Elizabeth Seger: So I have my website, ElizabethSeger.com. It sort of outlines my different ongoing research projects, has a lot of publications on it. Also, GovAI’s website is a wealth of information [on] all things AI governance from all my great colleagues at GovAI and our affiliates. So really, yeah, there’s new research reports being put out almost every week, maybe every other week, but really high quality stuff. So you can find a lot of my work on the GovAI website or my current work and past work on my own website or find me on LinkedIn. Yeah, just happy to talk more.

Daniel Filan: All right, well thank you very much for being on the podcast.

Elizabeth Seger: Great, thank you.

Daniel Filan: This episode is edited by Jack Garrett and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund and Lightspeed Grants, along with patrons such as Tor Barstad, Alexey Malafeev, and Ben Weinstein-Raun. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss

AXRP Episode 25 - Cooperative AI with Caspar Oesterheld

Published on October 3, 2023 9:50 PM GMT

YouTube link

Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I’ll be speaking with Caspar Oesterheld about some of his research on this very topic.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode, I’ll be speaking with Caspar Oesterheld. Caspar is a PhD student at Carnegie Mellon where he’s studying with Vincent Conitzer. He’s also the assistant director of the Foundations of Cooperative AI Lab or FOCAL. For links to what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net. All right. So welcome to the podcast, Caspar.

Caspar Oesterheld: Thanks. Happy to be on.

Cooperative AI

Daniel Filan: Yeah. So my impression is that the thing tying together the various strands of research you’re involved in is something roughly along the lines of cooperative AI. Is that fair to say?

Caspar Oesterheld: I think that’s fair to say. I do some work other than cooperative AI and cooperative AI can, I guess, mean many things to different people. But generally, I’m happy with that characterization.

Daniel Filan: Okay. So I guess to the extent that cooperative AI covers most of your work, what does that mean to you?

Caspar Oesterheld: So to me, the most central problem of cooperative AI is a situation where two different human parties, like two companies or governments, each builds their own AI system and then these two AI systems potentially – while still interacting with their creators – these two AI systems interact with each other in some kind of general-sum, as game theorists would say, mixed-motive setting, where there are opportunities for cooperation but also perhaps a potential for conflict. And cooperative AI as I view it, or the most central cooperative AI setting or question, is how to make these kinds of interactions go well.

Daniel Filan: Okay. So I guess this is the AI X-risk Research Podcast, and I also at least perceive you as being part of this x-risk research community. I think if you just say that, many people think, “Okay, is this a matter of life or death, or is this just it would be nice to have a little bit more kumbaya going on?” So how relevant to x-risk is cooperative AI?

Caspar Oesterheld: Yeah, I certainly view it as relevant to x-risk. Certainly that’s most of my motivation for working on this. So I suppose that there are different kinds of interactions between these different AI systems that one can think about, and I guess some of them aren’t so high stakes and it’s more about just having some more kumbaya. And meanwhile, other interactions might be very high stakes. Like if governments make their decisions in part by using AI systems, then the conflicts between governments could be a pretty big deal, and I don’t know, could pose an existential risk.

Daniel Filan: Can you flesh that out? What would an example be of governments with AIs having some sort of mixed-sum interaction where one of the plausible outcomes is doom, basically?

Caspar Oesterheld: So I suppose the most straightforward example would just be to take an interaction that already exists between governments and then say, “Well, you could have this, and well, also, the decisions are made in part between AI.” So I suppose there are various disputes between different countries, obviously different governments. Usually it’s over territory, I suppose. And sometimes as part of these disputes, countries are considering the use of nuclear weapons or threatening to use nuclear weapons or something like that.

So I guess the current, maybe most salient scenario is that the US and Russia disagree over what should happen with Ukraine, whether it should be its own country or to what extent it should be able to make various decisions about whether to join NATO or the EU or whatever. And as part of that, Russia has brought up or Putin has brought up the possibility of using nuclear weapons. I tend to think that [in] this particular scenario it’s not that likely that nuclear weapons would be used, but in the past, during the Cuban missile crisis or whatever, it seemed more plausible. And I suppose the most straightforward answer is just, well, we could have exactly these kinds of conflicts, just also with AI making some of the decisions.

Daniel Filan: Okay. And so how does AI change the picture? Why aren’t you just studying cooperative game theory or something?

Caspar Oesterheld: Yeah, good question. Okay. AI might introduce its own new cooperation problems. So you could have these AI arms races where maybe even once one has the first human-level AI systems or something like that, there might still be a race between different countries to improve these systems as fast as possible and perhaps take some risks in terms of building underlying systems in order to have the best systems. So that would introduce some new settings, but mostly I guess what I think about is just that there are game -theoretic dynamics or game-theoretic questions that are somewhat specific to AI. So for example, how to train AI systems to learn good equilibria or something like that. It’s an example that’s very natural to ask in the context of building machine learning systems and a bit less of a natural question if we think of humans who already have some sense of how to play in these strategic situations.

Cooperative AI vs standard game theory

Daniel Filan: Okay. A related question is: when I look at cooperative AI literature, it seems like it’s usually taking game theory and tweaking it or applying it in a different situation. And I think game theory… there are a bunch of problems with it that made me think that it’s a little bit overrated. I have this list. So at least according to Daniel Ellsberg, apparently it wasn’t particularly used by the US in order to figure out how to do nuclear war, which is relevant because that’s why they invented it. It’s not obviously super predictive. You often have these multiple equilibria problems where if there are multiple equilibria and game theory says you play in equilibrium, that limits the predictive power. It’s computationally hard to find these, especially if your solution concept is Nash equilibrium. It’s hard to figure out how to get to an equilibrium. It seems like there’s important things that it doesn’t model like Schelling or really salient options. I don’t know. It seems very simplified and it also seems difficult to account for communication in a really nice, satisfactory way in the game theory land.

So I guess I’m wondering to what extent is cooperative AI fixing the problems I’m going to have with game theory versus inheriting them?

Caspar Oesterheld: Yeah, that’s a good question. I think there are a number of ways in which it tries to fix them. So for example with respect to equilibrium selection, that’s a problem that I think about very explicitly. And then also in the cooperative AI setting or in the setting of building learning agents, it comes up in a more direct way. So with humans, if you have this descriptive attitude that you use game theory to predict how humans behave, you can just say, “Well, they’ll play Nash equilibria.” “Well, which ones?” “Well, anything could happen, it depends on what the humans decide to do.” And with AI, you’re forced a bit more into this prescriptive position of having to actually come up with ways to decide what to actually do. So you can’t just say, “Well, you can do this. You can do that. It’s up to you.” At some point you’ll have to say, “Okay, we want to use this learning scheme,” or something like that.

And I guess there are also some other things where with some of your complaints about game theory, I would imagine that lots of people, including game theorists, would agree that these are problems, but where they just seem fundamentally quite difficult. So this problem of people going for different equilibria, like the equilibrium selection problem for example: I think part of why there isn’t that much work on it or why people generally don’t work on this that much is that it seems intractable. It seems very difficult to say much more than, “Well, there are different equilibria depending on blah, blah, blah. Different things might happen.” It seems very difficult to go beyond that. And I think similarly with Schelling points, these natural coordination points or focal points as they’re sometimes called…I’m actually quite interested in that topic, but I guess there too, I would imagine that lots of game theorists would say, “Yeah, this is… just hard to say anything about it, basically.” That’s why less has been said about those.

And I think this computational hardness issue, that Nash equilibria are hard to find so people won’t actually play Nash equilibrium, they’ll do something else that’s probably a bit in the spirit of Nash equilibrium. Still they’ll try to best respond to what their opponent is doing or something like that, but probably they won’t exactly get it right. I think there the issue too is that describing these kinds of dynamics is just much, much harder than the simple Nash equilibrium model. And perhaps game theorists would, and perhaps I also would, think that it’s still useful to think about Nash equilibrium, and maybe that’s something you would disagree with or you would have a somewhat different view about.

Daniel Filan: Well, I certainly think it’s useful to think about. I’m certainly sympathetic to the idea “well, these are hard to study so we’re studying easier problems.” But I just want to check… Should I just believe this stuff or is it simplifying things in order to make any headway? I guess one question I had is: so you mentioned for a few of these things, it’s just hard to study. And a natural question is: do things get easier or harder to study in the cooperative AI setting where you have a few more tools available at your disposal?

Caspar Oesterheld: I think it’s a mixed bag. I think some things get harder because, especially if we’re thinking about future AI systems, we have less direct access to how they’re thinking about things. For example, Schelling points and equilibrium selection. If I ask you, I don’t know… Well, when we walk into each other, when we walk towards each other on the road and we have to decide who goes right and who goes left or whether to go to the right or to go to the left, how do we decide? There’s a very simple answer that we know because we’re in the US or something. And in the US, there’s right-hand traffic so we’re more likely to go to the right.

Daniel Filan: That rule fails way more than I think I would’ve guessed naively. As far as I can tell, the actual solution is people randomly guess one direction and then if you collide, you switch with an ever-decreasing probability and that seems to pan out somehow.

Caspar Oesterheld: Which, I think is the optimal symmetric solution is to randomize uniformly until you anti-coordinate successfully.

Daniel Filan: Yeah. Although I think there’s more alternation than just repeated randomizing uniformly. So, I don’t know. I feel like more than average, I end up in these situations where we’re on the same side and then we both switch to the same side and then we both switch to the same side… You know what I mean? I think there’s a bias towards alternation rather than just randomization. This is a little bit off-topic though. Or maybe it doesn’t-

Caspar Oesterheld: It seems plausibly true. But even so, even if what people are doing is irrational, I don’t know, we have some kind of intuition for what one does. I guess you have. I didn’t. I think you’re basically right about what people are doing. But you just think, “Okay. Well, we do this and that’s how we solve this.” And I don’t know, maybe we can come up with other examples where it’s more straightforward that people just follow some convention and it’s clear that people are following this convention because there is this convention. And so with AI, it’s harder to guess, especially if you imagine AI systems that haven’t learned to play against each other, face each other for the first time on some problem with equilibrium selection. It seems harder to say what they’re going to do.

On the other hand, there are also some things that I think are easier to reason about. Maybe the rational agent model is more accurate about the kind of AI systems that we worry about than about humans. Also, if we think more from this prescriptive perspective of trying to figure out how to make things go well, then there are a lot of tricks that apply to AI more than to humans. So I guess we’ll talk a bit about some of my work later that’s in part about some of these things that seem to apply more to AI than to humans. For example, it’s easier to imagine very credible commitments because the AI can be given any goals, we can, I don’t know, change its source code in various ways. Whereas with humans that’s harder to do.

Another way in which the situation might be harder for AI is that with humans - this is related to what we already talked about - humans have already trained a bunch against each other. Probably some kind of group selection type or selection on a convention level has occurred. So successful conventions or successful ways of selecting equilibria have been selected [over] conventions that fail somehow. And this more evolutionary-style learning seems less realistic for AI at least. It’s very different from gradient descent or other contemporary techniques.

Daniel Filan: It seems like there’s this general issue where on the one hand, the AIs can do more stuff. They can be given various types of source code, they can read other AIs’ source codes… On the one hand, that gives you more tools to solve problems, but then on the other hand presumably it just adds to the strategic complexity. So I guess a priori, it should be unclear if that makes things easier or harder.

Caspar Oesterheld: Yeah. Even some of the things that I described on the “making things easier” side, more opportunities to get good equilibria often also imply just more equilibria, which typically makes things harder.

Daniel Filan: I wonder: if I just think about this, a lot of these traits of AI can sort of be implemented for humans. I think in one of the papers we’re going to be talking about, about safe Pareto improvements, it’s basically about the setting where you give an AI the task to solve a game but you can change the game a little bit. And presumably, we could do that with humans. Your company has to play some game and it just outsources it to the game-playing department of your company and maybe your company can give instructions to the game-playing department. I’m wondering, have these sorts of questions been studied much in the setting of humans, or do you think AI is jolting people to think thoughts that they could have in principle thought earlier?

Caspar Oesterheld: Yeah. So with some of these, that’s definitely the case. Some forms of credible commitment are discussed in game theory more generally. And I also agree that this transparent source code or partially transparent source code type setting is not that bad as a model of some interactions between humans. Organizations, for example: human organizations are somewhat transparent. To some extent, we can predict what US Congress will do. It’s not some black box that has some utility function, that it’s going to try to best respond to something with respect to this utility functions. They have lots of rules for what they’re going to do. We can ask the individual congresspeople what they feel about certain issues. So to some extent, human organizations are a bit like, I don’t know, they have specified rules or constitution and so on. So in some sense, they also play this open source type game, or this game where your source code is somewhat transparent.

I think with a lot of these things, part of the reason why they aren’t studied in this way is that, for example, with this open source type model, traditionally that is, it’s just that it’s a bit less of a natural setting and it’s clear that the human setting is just very fuzzy in some sense. And I think the AI setting will actually also be very fuzzy in a similar sense, but it’s easier to imagine this extreme case of being able to read one another’s source code or something like that. Whereas for human organizations, it’s less natural to imagine this extreme case or this particular model of source code. But yeah, I definitely agree that some of these things one can consider in the human context as well.

Daniel Filan: Yeah. And it’s close to models where players have types and you can observe other types and everyone of a type plays the same, right?

Caspar Oesterheld: Yeah, right. There’s also that kind of model.

Do we need cooperative AI if we get alignment?

Daniel Filan: Yeah. So the final framing question I want to ask is: I think when a lot of people encounter this line of research, they think, “Okay. Well, we already have AI alignment of, AI is trying to adopt ‘do what people want them to do’.” If I’m, I don’t know, Boeing or the US or something, and I’m making AI that’s aligned with me, I want it to be able to figure out how to play games such that we don’t end up in some terrible equilibrium where the world gets nuked anyway. So I think a lot of people have this intuition, “Well, we should just make really smart AI that does what we want and then just ask it to solve all of these cooperative AI questions.” And I’m wondering: what do you think about that plan or that intuition?

Caspar Oesterheld: Okay. I think there are multiple questions here that I think are all very important to ask and that to some extent I also think are important objections to a lot of work that one might do. I guess one version of this is that if we build aligned AI, whatever problems we have, we can just ask the AI to solve those problems. And if the AI is just good at doing research on problems in general because it’s generally intelligent or something like that, then we should expect it to also be able to solve any particular problem that we are concerned about.

Daniel Filan: Or there’s a weaker claim you could make, which is it can solve those problems as well as we could solve them if we tried.

Caspar Oesterheld: Right, yeah.

Daniel Filan: Sorry, I interrupted.

Caspar Oesterheld: No, that’s a good point. Obviously, that’s the more important claim. And I think that is an important consideration for cooperative AI but also for a lot of other types of research that one might naively think it is valuable to do. I think there are lots of specific things where I think this really works extremely well. If you think about solving some specific computational problem, like developing better algorithms for finding the good Nash equilibria in a three-player, normal-form game or something like that, it seems very likely that if we get, or once we get human-level AI or superhuman level AI, it will just be better at developing these algorithms, or at least as good as humans at developing these algorithms. Because to some extent, it seems like a fairly generic task.

I would imagine that an AI that’s generally good at solving technical problems will be good at this particular problem. What I think is an important property of the kinds of problems that I usually think about or the kinds of ideas that I usually think about is that they’re much less well-defined technical problems and they’re much more conceptual. I guess we’ll be talking about some of these things later, but they seem much less like the kinds of problems where you just specify some issue to the AI system and then it just gives the correct answer.

And I also think that, and this goes maybe more towards a different version of this objection, but I think another important claim here is that in some sense, game theory itself or strategic interaction between multiple agents is a special thing in a way, in a way that’s similar to how alignment or having the kinds of values that we want it to have is special in that it’s to some extent orthogonal to other kinds of capabilities that one would naturally train a system to have. So in particular, I think you can be a very capable agent and still land in bad equilibria in some sense. A bad Nash equilibrium is an equilibrium where everyone behaves perfectly rational, holding fixed [what] their opponent [does], but the outcome is still bad. And so at least if you imagine that training mostly consists of making agents good at responding to their environment, then landing in a bad Nash equilibrium is completely compatible with being super competent at best responding to the environment.

Daniel Filan: Yeah. I guess maybe this goes to some underlying intuition of game theory being a deficient abstraction because, I don’t know, in a lot of these situations, in a lot of these difficult game theory things. Like the prisoner’s dilemma, where it’s better for us if we cooperate, but whatever the other person does, I’d rather defect, but if we both defect, that’s worse than if we both cooperate: a lot of the time I’m like, “Well, we should just talk it out”; or we should just build an enforcement mechanism and just change the game to one where it actually does work. And maybe this just ends up as… If these options are available to powerful AIs then somehow the game theory aspect maybe or the multiple Nash equilibria thing is maybe less relevant. So I don’t know, maybe this comes down to a rejection of the game theory frame that one might have.

Caspar Oesterheld: I’m not sure I understand this particular objection. By default, the normal prisoner’s dilemma, single shot without repetition, without being able to set up enforcement mechanisms, make incredible commitments and these kinds of things… it just has one Nash equilibrium, right?

Daniel Filan: Yeah. I think the complaint is: in real life, we just are going to be able to communicate and set up enforcement mechanisms and stuff.

Caspar Oesterheld: Yeah, I think I mostly agree with that. But the enforcement mechanisms are what would in the prisoner’s dilemma introduce multiple Nash equilibria, right?

Daniel Filan: Yeah.

Caspar Oesterheld: I guess in the prisoner’s dilemma, it’s natural what the good equilibrium is because the game is symmetric so you just could do the Pareto optimal symmetric equilibrium. So assuming you can make a cooperate-cooperate Nash equilibrium by playing the game repeatedly or having enforcement mechanisms and so on - then it seems very natural that that’s what you would do. But if we ignore the symmetry intuition or we just take some game that is somewhat more asymmetric, then it seems like even with these enforcement mechanisms, there might be many possible equilibria that are differently good for different players. And yeah, I don’t see how you would avoid the multiple Nash equilibria issue in that kind of setting.

Daniel Filan: Yeah. I guess the hope is that you avoid the issue of really bad Nash equilibria. But I guess you still have this to the degree that there’s some asymmetry or anti-symmetry or something. I guess you have this issue of there being multiple equilibria that you’ve got to bargain between. If we both have to coordinate to do a thing and I like one of the things more and you like one of the things more and somehow we can’t randomize or something, I guess in that situation we’ve got this bargaining game of “okay, are we all going to have this rule that says we do the thing you like? Or are we all going to have this rule that says we do the thing I like?” And it ends up being a metagame, I guess.

Caspar Oesterheld: Yeah, I think I’m more worried about the multiple equilibrium thing than about getting the good equilibrium if there’s a unique good equilibrium. If you think of something like the stag hunt or a prisoner’s dilemma where all you really need to do is, I don’t know, pay 10 cents to set up this commitment mechanism that allows you to play cooperate-cooperate in equilibrium, I’m also much more optimistic about that. It’s still funny because in some sense there’s not really that great of a theory for why the good thing should happen, but it seems intuitive that it would happen.

And I think maybe this is also one of the… one way in which game theory is weird that even this very basic thing where it feels like you can just talk it out and say like, “Yeah, we are going to do this good thing,” and so on. Game theory doesn’t really have a model for this “talking it out” thing to get the good equilibrium. Really, that doesn’t seem that difficult to do. But yeah, the case that I’m much more worried about is the case where you have lots of different equilibria and the game is very asymmetric and so it’s just completely unclear which of the equilibria to go for.

Cooperative AI and agent foundations

Daniel Filan: Okay. I said that was my final question. But I actually have a final bridging question before I get to the next thing we’re going to talk about, which is: I think a lot of cooperative AI stuff, it often seems adjacent to this agent foundations line of work where people are really interested in: you’re an AI and there are other players in the world and they’re modeling you and you’re modeling them, but you can’t perfectly model something that’s perfectly modeling you because your brain can’t properly fit inside your brain. And people in this world are often interested in different decision theories or ways of thinking about how to be uncertain when you’re computationally limited. A lot of this kind of thing seems like it shows up in the cooperative AI literature, I guess specifically your research group’s cooperative AI literature. I’m wondering: why do you think that overlap is there and what do you think are the most important reasons driving that overlap?

Caspar Oesterheld: Yeah, that’s another very interesting question. I should start by saying that there are also lots of people who work on cooperative AI and who aren’t thinking that much about these more foundational questions. I don’t know, [they’re] more on the side of developing machine learning algorithms or setups for machine learning such that different machine learning agents just empirically are more likely to converge to equilibria or to better equilibria and things like that. But yeah, I agree that with respect to me in particular, there’s a lot of overlap. Okay, I think there are a couple of reasons. One reason is that it’s good to do things that are useful for multiple reasons. So if you can do something that’s useful both for analyzing the multi-agent stuff and also good for this more agents foundations perspective, that’s nice.

I think the deeper reason for why there’s overlap has to do with a lot of just the object-level issues of game theory and how they interact with these things. For example, it seems that this issue of not being able to fully model your environment, it comes up very naturally in game theory because basically if you have two rational agents, there’s kind of no way to really assume that they can both perfectly have a model of the other player or something like that. And they can best respond to each other, but these kind of assumptions…

In the case of mere empirical uncertainty, you can at least theoretically make these kind of assumptions, that you have your Bayesian prior or something like that and you just do Bayesian updating and then you converge on the correct world model. But everyone knows that’s not actually how it works, but you can easily think about this, and so you just think about this even though actually the real systems that actually do stuff in the real world don’t really work this way, but you can have this model. Whereas in game theory, it’s much harder to avoid these kinds of issues.

Daniel Filan: But there… I mean, if that was it, I would expect that there would be this whole field of game theory AI and they would all be interested in agent foundations stuff, but somehow it seems like the cooperative AI angle… I don’t know, maybe I just haven’t seen the anti-cooperative AI agent foundations work.

Caspar Oesterheld: Yeah, so there is actually lots of work that is on these kind of foundational questions and that isn’t particularly motivated by foundations of cooperative AI. So there’s this whole “regret learning” literature, which is… It’s a huge literature about some theoretical model of learning in games (or learning in general, but you can in particular apply it to learning in games). So it’s a model of rationality. You can use it as a model of bounded rationality. And then there are many, many papers about what happens if you have two regret learners play against each other, how quickly do they converge to what kind of solution concept? Do they converge to Nash equilibrium or coarse correlated equilibrium or whatever? So I think this literature basically just does exist.

Daniel Filan: Okay.

Caspar Oesterheld: I think most of this is a bit less motivated by AI, but yeah, there’s definitely a lot of foundational work on the intersection of what’s a rational agent, what should a rational agent do, or how should one learn stuff, and multi-agent interactions, that isn’t about cooperation.

A Theory of Bounded Inductive Rationality

Daniel Filan: Okay, cool. Well, I guess bridging from that, I guess I’d like to talk about your paper, A Theory of Bounded Inductive Rationality. So this is by yourself, Abram Demski, and Vincent Conitzer. Could you give us a brief overview of what this paper is about?

Caspar Oesterheld: Sure. So this paper considers a setting, which we for now might imagine is just a single agent setting where you have a single agent and it has to learn to make decisions. And the way it works is that every day, or I don’t know, at each time step, as you might say, it faces some set of available options and it can choose one of these options. So you might imagine, every day someone comes to you and puts five boxes on your table with different descriptions on them and then you have to pick one of them. And then once you pick a box, you get a reward for the box that you pick and you only observe that reward. You don’t observe any kind of counterfactuals. You don’t observe what would’ve happened had you taken another box. And in fact, it might not even be well-defined what would’ve happened, like what are these counterfactuals anyway? So all that happens, basically, you can choose between different things. You choose one of the things, you get a reward.

And this happens every day or every time step. And now the question is: how should you learn in this kind of setting? How should you learn to make choices that maximize reward? And in particular, we consider a myopic setting, which means we just want to maximize the short-term reward, like the reward that we get right now with the next box that we pick. I don’t know, maybe at some point we figure out that whenever you take the blue box, maybe you get a high reward, but then for the next 1,000 times steps, the reward will be low. The basic setting that the paper considers is such that you then still want to take the blue box. You don’t care about this losing the reward on the next 1,000 days, though you can consider alternatives to this.

Okay, so this is the basic setting of learning and this is very similar to this bandit literature where regret minimization is kind of the main criterion of rationality that people consider. So it’s very similar to that. The main difference is that we-

Daniel Filan: Wait, before you go on, what is regret minimization?

Caspar Oesterheld: Oh yeah, good question. I should explain that. So here’s one notion of what it means to do well in this type of scenario. I don’t know, let’s say you play this for 1,000 rounds and then in the end you ask yourself, “How much better could I have done had I done something else?” So for example, how much better could I have done had I only taken the - let’s say every day there’s a blue box - had I always just taken the blue box? And so the regret is kind of like: how much worse are you doing relative to the best thing you could have done in retrospect? And usually there are constraints on what the best thing in retrospect is: usually it’s just not achievable to have low regret relative to on each day picking the best thing.

Okay, I’ll just introduce one more aspect of the setting because that’s also important for our paper. So for example, you could have some set of experts, that you might imagine is just some set of algorithms, and on each day each expert has a recommendation for what you should do. Then your regret might be, “How much worse did I do relative to the best expert?” So after 1,000 days, you look at each expert and you ask, “How much [better] would I have done had I chosen at each time step what this expert recommended?”

Daniel Filan: And so my understanding is: the reason we’re saying something like this rather than just “in retrospect, the best thing you could have done ever” is, if the results are random or unknowable or something, in retrospect we could just pick the winning lottery numbers and stuff, but you don’t want to allow those sorts of strategies, is my understanding.

Caspar Oesterheld: Yeah, exactly. Very good. Yeah, thanks for explaining this. Yeah, I think that is exactly the justification for doing that.

Daniel Filan: Okay.

Caspar Oesterheld: Okay, so minimizing regret means minimizing regret with respect to the best expert in retrospect. And the intuition here from a learning and rational agent perspective is that, I don’t know, you’re some agent, you have limited computational abilities, and let’s say that this set of experts is the set of algorithms that you can compute or something like that, or you can compute all simple strategies for what you should do. And then low regret means that there’s no… In this set, there’s no strategy that does much better than what you do. And in particular, usually what people talk about is sublinear regret, which means that as time goes to infinity, your average regret per round goes to zero, which in some sense means that in the limit you learn to do at least as well as the best expert.

So one very simple version of this is: imagine you play some game, you play chess or something like that, and there are 10 people in the room and they each make recommendations for what you should play, and you don’t know anything about chess. And now one of the people in the room is a chess grandmaster and the other ones are just random non-chess-grandmasters who are maybe much worse, or a bit worse. Then it seems like if you learn in a reasonable way, you should learn to do at least as well as the grandmaster, right? Because the least thing you can learn is just do whatever this grandmaster does, right?

Daniel Filan: Yep.

Caspar Oesterheld: That’s kind of the intuition here.

Daniel Filan: But in your paper, you’re contrasting yourself with regret minimization.

Caspar Oesterheld: Yeah.

Daniel Filan: So what’s so bad about regret minimization? Isn’t that just equivalent to ‘do as well as you can’?

Caspar Oesterheld: Yeah, I think regret minimization is certainly a compelling criterion. I think in the agent foundations, existential risk community, I think too few people are aware of regret minimization and how it’s a very simple criterion and so on. Okay, now I want to make the case against this. So why don’t I find this very satisfying? So it has to do with the way that regret minimization reasons about counterfactuals. So an immediate question that you might have about regret minimization is, “Is this even ever achievable?” Can you even ensure that you have low regret or sublinear regret? Because you might imagine a setting where the environment exactly computes what you do and always does the opposite, right?

Like, on every day you have two boxes in front of you and one of them contains a dollar and the other one doesn’t. And the way the environment fills the boxes is that it predicts what you do and then it puts the dollar in the other box. And in this kind of setting you might imagine it seems really hard to achieve low regret here so it will be difficult to, in retrospect, not do worse than, for example, always taking the left box, right? Because let’s say you switch back and forth between the two, then in retrospect half the time the left box had the dollar. So you would’ve done better had you just always taken the left box.

Daniel Filan: Yeah. For “if you would’ve done better”… somehow this is assuming that I could have just taken the left box, but the problem would’ve been the same as it actually was.

Caspar Oesterheld: Exactly. Part of how this criterion works is that it assumes that the problem says… the specific instance of this multi-armed bandit problem (as it’s called) that you face consists in specifying at each time step, how much money or how much reward is in each of the boxes, and that this is in some sense independent of what you do. In particular the adversarial multi-armed bandit literature very explicitly allows the case where the way the boxes are filled happens to be exactly running whatever algorithm you run. So how is this solved? So the way this is solved is that the learner has to use randomization. So you have to randomize over which boxes you pick or which expert’s advice you follow.

Daniel Filan: This is how it’s solved in the adversarial-

Caspar Oesterheld: Yeah, in the adversarial setting. There are also non-adversarial settings where basically, I don’t know, you know from the start that each of the boxes on every day follows the same distribution, like some Gaussian with some mean. The whole task consists in doing optimal exploration to find out which box has the highest mean value. And there you don’t need randomization. I think in practice probably people still do randomization in these cases, but you definitely don’t need it.

Whereas in this adversarial case, to achieve sublinear regret, you have to randomize and you have to also assume that the environment cannot predict the outcomes of your random coins. It’s like, I don’t know, if you flip a coin to decide whether to take the left or the right box, then you assume that the environment, it can also flip a coin, but it can only flip a different coin.

So I find that a bit dissatisfying philosophically, it seems a bit weird to assume this, but that maybe I can live with. I think the part that I really kind of don’t like is that in these problems where you randomize or where you need to randomize, regret minimization requires that you randomize, but the restrictions that it imposes are all on your actions rather than on the distributions over actions that you pick.

Daniel Filan: How do you mean?

Caspar Oesterheld: Yeah, so here’s an example, which is it’s basically Newcomb’s problem. So imagine that there are two boxes. The way the environment fills the boxes works as follows. It runs your algorithm to compute your probability of choosing the left box. And the higher this probability is of you choosing the left box, the more money it just puts in both boxes, but then also it puts an extra dollar into the right box. And we make it so that your probability of choosing the left box, it really, really strongly increases the reward of both boxes.

So you get, I don’t know, for each 1% of probability mass that you have on the left box, you get a million dollars or something like that put in both boxes. The numbers don’t need to be that extreme, I just don’t want to think about the exact numbers or how large they need to be. So in this problem, if this is how the problem works, then if you want to maximize your utility and you maximize over your probability distribution over boxes, then I think the reasonable thing to do is to always pick the left box. Because basically for each probability mass that you put on the left box, you gain lots of money. Whereas for each, you only gain $1 by moving probability mass from the left box to the right box.

Daniel Filan: Yep. And you lose all the money you could have made by having probability mass on the left box and both boxes get money.

Caspar Oesterheld: Yeah. So for example, if you always choose the right box, you only get a dollar. And if you always choose the left box, you get a hundred million dollars. Okay, so this is an example of a setting, and if you optimize over probability distributions, then you should choose the left box. This is not what regret minimization says you should do. Regret minimization would here imply that you have to always choose the right box, well, in the limit. You have to learn to always choose the right box. And the reason for that is that if you choose the left box, you’ll have regret, right? You’ll regret that you didn’t choose the right box because then you would’ve gotten a dollar more.

And the reason why… that’s kind of what I said somewhat cryptically, in some sense the reason for why this issue occurs is that regret minimization is, it’s a criterion on which actions you take and it doesn’t care about how you randomize. It kind of doesn’t say anything about how you should randomize other than saying, “Well, you should randomize over the things that give you high reward holding fixed how you randomize,” or I don’t know, “We don’t care about how you randomize.” It says something about these actions and it doesn’t say how you should randomize. And I think this is just not so compelling in this particular problem. I think, in this particular problem, one should just always choose the left box.

Daniel Filan: And is this also true in the setting of adversarial bandits?

Caspar Oesterheld: Yes. Basically adversarial bandits allow specifically for this kind of problem. To get high reward on these, one has to do this randomization stuff where sometimes one has to learn to make the probability distribution so that one’s reward is low, just to make one’s regret also low.

Daniel Filan: All right. So okay, that’s what’s wrong with other settings. So what’s your theory of bounded inductive rationality?

Why it matters

Caspar Oesterheld: Can I say something about why I think this is even important?

Daniel Filan: Oh, yeah: why is it important?

Caspar Oesterheld: Okay, so this particular setting, this “left box, right box” example is farfetched, of course, but the reason why I think it’s important is that it’s this kind of setting where the environment tries to predict what you do and then respond to it in a particular way. To some extent, that’s kind of the core of game theory. And so if you want to use these regret minimizers specifically in a game theoretic context, I think it’s kind of weird to use them given that it’s very easy to come up with these game theory-flavored type of cases where they clearly give weird results. So in particular, I think sometimes people justify Nash equilibrium by appealing to these regret minimizers, so you can show that if you have two regret minimizers and they play a game against each other, they converge to… I think it’s not exactly Nash equilibrium, it’s some form of correlated equilibrium.

But yeah, some equilibrium concept. (That is if they converge. Sometimes they also don’t converge.) But if they converge, they converge to some kind of Nash-like solution concept. And so you might say, “Well, it’s great. This really shows why Nash equilibrium is a good thing and why rational agents should play Nash equilibrium against each other.” In some sense, I think this Nashian idea is already kind of baked in in this regret minimization concept in a way that seems not so compelling, as this “left box right box” example shows. Yeah, so that’s why my theory… That’s why I’ve worked on this and tried to come up with a theory that reasons about randomization and things like that in a very different way.

Daniel Filan: Okay. Can I try to paraphrase that? So we want a theory of how to be a rational agent and there are a few criteria that we want. Firstly, we want to be able to go from something like how do you make decisions normally, we want to be able to use that in the context of game theory as some sort of foundation. And secondly, we want to have it be bounded. So that’s why we’re sort of thinking of the regret minimization, some number of experts frame where we just want to consider all the ways of doing things we can fit in our head and not necessarily worry about everything. So we want to have a decision theory foundation for game theory and we want to be bounded. And because we want a decision theory foundation for game theory specifically, we want our decision theory method… We want our way of making decisions to allow for environments that are modeling us.

And so basically your complaint is: well, normal regret minimization, it does the bounded thing, but it doesn’t do a very good job of thinking about the environment modeling you. There are various ways of thinking about game theory where you can think about the environment modeling you, but you’re not necessarily bounded. Probably my favorite paper in this line is the ‘reflective oracles’ line of work where– sadly, I can’t explain it right now, but if you assume you have something that’s a little bit less powerful than a halting oracle, but still more powerful than any computer that exists, then agents that just model their environment and make decisions, they end up playing Nash equilibria against each other. It’s a really cool line of research. I encourage people to read it, but no existing thing could possibly implement the thing that those papers are talking about. So you want to get all of these three criteria: boundedness, decision theory to game theory, and environments that are modeling the decision maker. Is that right?

Caspar Oesterheld: Yes. That’s a very good summary. Thanks.

How the theory works

Daniel Filan: All right, I now feel like I understand the point of this paper and how it’s related to your research agenda much better. Okay, so we’ve talked about what you want to be doing. How’s it work? What do you have?

Caspar Oesterheld: Okay, so we have the same kind of setting as before. On each day, we choose between a set of options. And slightly different, we don’t require that these counterfactuals, which one talks about a lot in regret minimization, we don’t require that these are well-defined. And we also have these experts… I guess we call them hypotheses rather than experts, but they basically do the same thing except that in addition to making a recommendation at each time step, they also give estimates of the utility that they expect to get if the recommendation is implemented.

So if you, again, have this picture of: I’m trying to play chess and there are 10 people in the room, then one of them might say, “Okay, if you play Knight F3 then I think there’s a 60% chance that you’re going to win or you’re going to get 0.6 points in expectation,” or something like that. So they’re a bit more complicated than the experts, but only slightly; they give you this one additional estimate. And now, similar to regret minimization, we define a notion of rationality relative to this set of hypotheses. I think I want to describe it in terms of the algorithm rather than the criterion for now, because I think the algorithm is actually slightly more intuitive.

Daniel Filan: So this is the opposite way than how you do it in the paper (for people who might read the paper).

Caspar Oesterheld: Yeah, the paper just gives the criterion and then the algorithm is just kind of hidden in the appendix. I mean, it’s like some brief text in the main paper saying that, “Yeah, this is roughly how it works.” In some sense, of course, the general abstract criterion is much more… I think it’s more important than the algorithm itself. I think it’s a bit similar to the logical induction paper where they define similarly a kind of notion of rationality or having good beliefs or something like that. And yeah, they have this criterion, but they also have a specific algorithm, which is a bit like running a prediction market on logical claims. And I think… In practice, I think many more people have this idea in their minds of “just run a prediction market between algorithmic traders” than this specific criterion that they define. Even though I think from a theoretical perspective, the criterion is really the important thing and the algorithm is just some very specific construction.

Daniel Filan: Yeah. I actually want to talk more about the relationship to the logical inductors paper later. So what’s your algorithm?

Caspar Oesterheld: So basically the algorithm is to run an auction between these hypotheses. So first we need to give the hypotheses enough money to deal with. So they now have money, like some kind of virtual currency. We need to be a bit careful about how exactly we give them money initially. It’s especially tricky if we have infinitely many hypotheses, but basically we need to make sure that we eventually give all hypotheses enough money so that we can explore them, which has to be infinite money. But we also, if we give everyone $10 in the beginning, then this auction will just be chaos because there will be lots of crazy hypotheses that bid nonsense. So maybe it’s easiest to first consider the case of finitely many hypotheses, and let’s just say that in the beginning we gave each hypothesis 100 virtual dollars. And now we run auctions and the way do this specifically is that we just ask all the hypotheses for their recommendation and their estimate of what reward we can get if we follow the recommendation.

Then we follow the highest bidder, so we take the highest bidder. But the hypotheses, in how much they can bid, they’re constrained by how much money they have. So if a hypothesis doesn’t have any money, it can’t bid high in this auction. So we take the highest bid - the highest budgeted bid, I guess - then we take that hypothesis. The hypothesis has to “pay us” their bid. So it’s like a first price auction. They pay us their bid and then we do whatever they told us we should do. Then we observe, we get our reward, and then we pay them that reward or a virtual currency amount proportional to that reward.

Daniel Filan: Okay. So is the idea that if a hypothesis slightly low-balls, but it’s basically accurate about how much reward you can get, other hypotheses that overestimate how well you can do are going to blow all their money, you’re going to save it up because you get some money every round. And then eventually you could be the top bidder and you’ll make money by… you slightly low ball, and so you get back a little bit more than you spent to make the bid and you just eventually dominate the bids. Is that roughly how I should think of it working?

Caspar Oesterheld: Yes, that’s basically the thing to imagine. I think depending on… Yeah, with this low-balling, that depends a bit on the scenario. For example, if you have a setting where the payoffs are deterministic and fully predictable, so you just choose between a reward of 5 and a reward of 10 and there are hypotheses that can just figure out that these are the payoffs, then if you have enough hypotheses in your class, then there will be just one hypothesis that just bids 10 and says you should take the 10 and then you won’t… The winning hypothesis won’t be low-balling, it will just barely survive. It’s like the typical market argument where if you have enough competition then the profit margins go away.

Daniel Filan: Okay. And the idea is that other agents which overpromise are going to… If agents overpromise, then they lose money relative to this thing that bids accurately. And if agents underpromise, then they don’t win the auction. And so this thing never spends money so it just survives and wins.

Caspar Oesterheld: Yeah, basically that’s the idea. I guess the most important kind of features, the first thing to understand is really that the hypotheses that just claim high rewards and don’t hold up these promises, they just run out of money and so they don’t control much what you do. And so what’s left over once these are gone is you basically follow the highest bid among those bidders that do hold up their promises. And so in the limit, in some sense, among these you actually do the best thing.

Relationship to logical induction

Daniel Filan: Cool. And so basically in the rest of the paper, if I recall correctly, it seems like what you do is you basically say… well, you make a bunch of claims, which net out to, if one of these traders in this auction can figure out some pattern, then you can do at least as well as that trader. So if you’re betting on these pseudorandom numbers, then you should be able to do at least as well as just guessing the expectation. And if none of… If the pseudorandomness is too hard for any of the agents to crack, then you don’t do any better than that. At the end, I think there’s a connection to game theory, which we can talk about a bit later, but we touched a bit on this relationship to logical inductors. When I was reading this paper, I was thinking, the second author, Abram Demski, I think he’s an author on this logical inductors paper from 2016 or so.

Caspar Oesterheld: I think he’s actually not. I’m entirely sure, but I think he might not be.

Daniel Filan: Oh, he’s not? Okay. He is a member of the organization that put out that paper at least.

Caspar Oesterheld: He is very deep into all of this logical induction stuff. I’m not sure I could have written this paper without him because knows these things very well and I think that was quite important for this particular project.

Daniel Filan: Yeah. You’ve got this coauthor who’s connected to the logical induction world. They’re both about inductive rationality by bounded beings, and they both involve these algorithms where agents bid against each other. How is this different from the logical induction paper?

Caspar Oesterheld: I guess the main and most obvious difference is that the logical induction paper is just about forming beliefs in some sense. It’s just about assigning probabilities to statements. You can think about how you can use that then to make decisions, but very basically, if you just look at the logical induction paper, it’s all about forming beliefs about whether a particular claim is true. Whereas regret minimization, but also this rational inductive agency project, it’s all about making decisions between some set of options. I think that’s a fundamentally different setting in important ways. I think the most important way in which it’s different is that you have to deal with this counterfactual issue, that you take one of the actions and you don’t observe what would’ve happened otherwise. For example, one way in which you can clearly see this is that in any of these decision settings, you have to pay some exploration cost.

With the bounded rational inductive agents, sometimes you will have to follow a bidder, a hypothesis that has done terribly in the past. You need to sometimes follow it still because there’s some chance that it just did poorly by bad luck. In fact, we didn’t go that much into the details of this, but you actually have to hand out money for free to your hypotheses so that you can exploit each hypothesis infinitely often because otherwise, there’s some chance that there’s some hypothesis that really has the secret to the universe, but just on the first hundred time steps for some reason it doesn’t do so well. You have to pay this cost, and regret minimizers also pay this cost. With logical induction, in some sense you do exploration, but one thing is that if you… maybe this will be hard to follow for readers who aren’t familiar with that. Sorry, listeners.

Daniel Filan: I’m now realizing we should probably just say a few sentences about what that paper was. According to me, the logical inductors paper was about how do you assign probabilities to logical statements. The statements are definitely either true or false. They’re statements like “the billionth and 31st digit of pi is 7”, and you’re like, what’s the chance that that’s true? Initially you say 1/10th until you actually learn what that digit of pi actually is by calculating it or whatever, and it uses a very similar algorithm. The core point of it is every day you have more questions that you’re trying to assign probabilities to, and you learn more logical facts and eventually you just get really good at assigning probabilities.

Caspar Oesterheld: Yeah.

Daniel Filan: Anything I missed out on?

Caspar Oesterheld: Well, I guess for what I was about to talk about it’s good to have some kind of intuition for how the actual algorithm or the proposed mechanism works. Very roughly it’s that instead of hypotheses or experts, they have traders which are also, I don’t know, some set of computationally simple things. Very roughly what happens is that these traders make bets with each other on some kind of prediction market about these different logical claims, that one is trying to assign probabilities to. Then the idea is if there’s a trader that’s better than the market at assigning probabilities, then the trader can make money by bidding against the market. Eventually that trader will become very wealthy and so it will dominate the market. That way you ensure that in some sense you do at least as well as any trader in this set.

Daniel Filan: Crucially, traders can kind of choose what markets to specialize in. If you’re betting on digits of pi or digits of e, I can be like, “well, I don’t know about this e stuff, but I’m all in on digits of pi”. I guess this can also happen in your setting if you’ve got different decision problems you face.

Caspar Oesterheld: Yeah. That is the way in which our bounded rational inductive agency theory is more similar to the logical inductors. Our hypotheses are allowed to really specialize and they only bid every 10,000 steps on some very special kind of decision problem, and otherwise they just bid zero. You still get that power in some sense. Whereas the regret minimizers don’t have this property. Generally, you really only just learn what the best expert is.

Daniel Filan: Cool. I cut you off, but sorry, we were saying about a comparison with logical inductors and it was something about the traders.

Caspar Oesterheld: Yeah. One thing is that if you have a new trader, a trader that you don’t really trust yet, then you can give them a tiny amount of money and then they get to make bets against the market. If you give them a tiny amount of money, their bets won’t affect the market probabilities very much. You can explore this in some sense for free. In some sense, they don’t influence very much what you think overall because the idea is even if you give them a tiny amount of money, if they’re actually good, they’ll be able to outperform the market and they’ll be able to get as much money as they want. That isn’t possible in the decision-making context because to explore a hypothesis in the decision-making context, you have to make this “yes/no” decision of actually doing what they do. If you have a hypothesis that’s just a complete disaster, you’re going to make this disastrous decision every once in a while.

Daniel Filan: Yeah. In some sense it’s because decisions just have discrete options. If you’re a predictor, you can gradually tweak your probability a tiny bit, but you can’t quite do that in…

Caspar Oesterheld: Yeah. Though, I think the fundamental issue has more to do with these counterfactuals and then the counterfactuals not being observed. To test a hypothesis, you have to do something differently. Let’s forget about the logical inductors for a second and just say, I don’t know, you make some predictions about all kinds of things and I am unsure whether to trust you or not. Then I can just ignore what you say for the purpose of decision-making or for the purpose of assigning beliefs myself or stating beliefs to others. I can completely ignore this. I don’t have to do anything about it, and I can just track whether you are right. If you’re good, I’ll still eventually learn that you’re good. Whereas if you tell me, I don’t know, you should really do more exercise or something like that and I ignore it, I’ll never learn whether you were actually right about it or not.

Daniel Filan: Okay. Yeah, it’s giving me a better sense of the differences these theories have. Cool. Another difference, which was kind of surprising to me, is that in the logical inductor setting, I seem to recall hearing that if you actually just do the algorithm they proposed in the paper, it’s something like 2 to the 2 to the X time to actually figure out what they even do. Whereas with your paper, if all of the bidders about what you should do, if they’re computable in quadratic time, if it’s just all the quadratic time algorithms, it seemed like you can have your whole routine run in quadratic time times log of log N or some really slow growing function, which strikes me as crazy fast. What’s going on there? For one, what’s the difference between the logical inductor setting, and two, how is that even possible? That just seems like so good.

Caspar Oesterheld: Yeah. That is a kind of notable difference, in some sense. This algorithm that I just described, this decision auction, as we sometimes call it, it’s just an extremely low overhead algorithm; you can just think about it. You run all your bidders, then you have a bunch of numbers. You have to take the max of these numbers. That’s all very simple. The reason I think why the logical induction algorithm that they propose is relatively slow is that they have to do this fixed point finding. I don’t know.

I think roughly the way it actually works is that the traders, it’s not really like a prediction market in the literal sense. I think the traders actually give you functions from market prices to how much they would buy or something like that. Then you have to compute market prices that are a fixed point of this, or an approximate fixed point of this. This fixed point finding is hard. I think maybe here the continuity is a big issue; that probabilities are continuous, so in some sense you need to find something in a continuous space.

The correct probability in some sense is some number between 0 and 1, or actually I think it’s the probability distribution over all of these logical statements or the probabilities of all of these logical statements. It’s this really large object or an object that in some sense carries a lot of information. It’s from some large set, so they need to find this in this large set. Whereas we only have this one decision. But I think that on some level, the criteria are just pretty different. They’re similar in many ways in that they designed this market and so on, but I think on some level the criteria are just somewhat different between the logical induction paper and ours.

Daniel Filan: Yeah. It’s strange though because you would think that making good decisions would reduce to having good beliefs. One way you can do the reduction is every day you have some logical statement and you have to guess “is it true with probability 1?”, with probability 99%, 98%, and I don’t know. You just have 100 options and you have to pick one of them.

Caspar Oesterheld: Yeah. In theory that kind of works, but if you apply a generic decision-making framework to this setting where you have to say the probability or decide which bets to accept or something like that, then you’re not going to satisfy this really strong criteria that this Garrabrant [logical] inductor satisfies. For example, if you have the bounded rational inductive agents and on each day you have the choice between what’s the highest price you would be willing to pay for a security that pays a dollar if some logical statement is true and pay zero otherwise. That’s assigning a probability to that statement, but you have to explore all of these hypotheses that say, yeah, you should buy the $1 thing. You should buy it for $1. Even for digits of pi, where let’s say you take… or coin flips, something that’s actually random and where you should just learn to say one half every day. The bounded rational inductive agents will necessarily say any answer infinitely often. They will converge to giving one of the answers with limit frequency 1, but every once in a while they’ll say it’s something completely different.

Daniel Filan: Yeah, so it’s not like actual convergence.

Caspar Oesterheld: Yeah. I think in the regret minimization literature, people sometimes say they have these different notions of convergence. It’s convergence in iterates and convergence in frequencies I think, or something like that, and you only have the weaker thing with bounded rational inductive agents.

How fast does it converge?

Daniel Filan: Yeah. In fact, this is one of the things I was wondering, because in the paper you prove these limit properties and I’m wondering, okay, am I going to get convergence rates? It seems like if you make the wrong decision infinitely often, but infinitely less frequently, in some sense it feels like the right thing to say is that you have converged and you can talk about the convergence rate, but you haven’t actually literally converged. Is there some intuition we can get on what the convergence or quasi-convergence properties of these things are over time?

Caspar Oesterheld: Yeah. One can make some assumptions about a given bounded rational inductive agent and then infer something about the rate at which it converges. Roughly, I think the important factors are the following. The first is: how long does it take you to explore the hypothesis that gives you the desired behavior? If you imagine, I don’t know, you have some computational problem deciding whether a given graph has a clique of size 3 or something like that, that can be done in cubic time if not faster. And so you might wonder… The first thing is how long does it take you to explore the hypothesis that just does the obvious computational procedure for deciding this, which is just try out all the combinations of vertices and deciding whether that’s a clique. In some sense it’s kind of similar to these bounds that you have for Occam’s razor-type or Solomonoff prior things where it’s kind of like the prior of the correct hypothesis.

Similarly here, it’s a bit different because really what matters is: when do you start giving it money? Then there’s still… if things are random, then it might be that you give it money, but then it has bad luck and then it takes a while for it to… Yeah. That’s the first thing that matters. Then the other thing that matters is: how much other exploration do you do? If you explore something very quickly, but the reason you explore it very quickly is that you explore everything very quickly and you give lots of money to lots of hypotheses that are nonsense, in some sense, it then takes you longer to converge in the sense that you’re going to spend more time doing random other stuff. Even once you’ve found the good thing, the good hypothesis that actually solves the problem and gives an honest estimate, you’ll still spend lots of time doing other stuff.

Daniel Filan: You have to have this balancing act in your payout schedule.

Caspar Oesterheld: Even just to satisfy the criterion that we described, one has to make sure that the overall payoffs per round go to zero. In some sense the overall payoffs per round is kind of how much nonsense you can do, because to do nonsense, you have to bid high and not deliver, which is kind of like losing money out of this market. That’s how you control that. Then meanwhile, you also have to ensure that each trader or each hypothesis eventually gets infinite amounts of money.

Daniel Filan: Yeah.

Caspar Oesterheld: These are the two things. Within satisfying these constraints you can balance them in different ways.

Daniel Filan: Yeah, I guess 1 over T is the classic way to do this sort of thing.

Caspar Oesterheld: Yeah, that’s the one that we have in the proof.

Non-myopic bounded rational inductive agents?

Daniel Filan: Yeah. That’s actually how I… it just flashed in front of eyes and I was like, ah, now I see why they picked that. Yeah, I just skimmed that appendix. Yeah. One thing I want to ask about is: you talk about this bandit setting, and in particular you’re being myopic, right? You see a thing and you’re supposed to react myopically to the thing. And there are other settings, other relatively natural settings like Markov decision processes or something. I’m wondering: how do you think the work could be extended to those sorts of settings?

Caspar Oesterheld: Yeah. That’s a good question. I think there are different ways. I guess one way is that if it’s, for example, an episodic Markov decision process or I don’t know, some other thing…

Daniel Filan: Oh, yeah. That was my fault. Yeah. A Markov decision process is like, you’re in a state of the world, you can take an action and the world just works such that whenever you’re in a state and you take a certain action, there’s some other state that you go to with some fixed probability no matter what the time is. Similarly, you get some reward similarly with some sort of fixed probability. Anyway, I interrupted you and I forgot what you said, so maybe you can start again. I’m sorry.

Caspar Oesterheld: Right. There are different answers to how one would apply bounded rational inductive agents or extend them to the setting. I guess the most boring thing is: well, you can just apply them to find a whole policy for the whole Markov decision process, or sometimes there are these episodic Markov decision processes, which basically means you act for 10 time steps, then you get a reward, and then you kind of start over and then you can treat the episodes as separate decision problems that you can solve myopically. Okay. That’s a relatively boring answer. The more interesting thing is that you could have something like a single Markov decision process and all you do is you play the single Markov decision process and it never starts over, and you want to maximize discounted reward, which is, if you get a reward of 1 today, it’s worth 1 to you. If you get a reward of 1 tomorrow, you get 0.9, if you get it the day after, 0.81 and so on. That would be a discount factor of, well, 0.9 or 0.1.

In this case, I think what one could try to do, and I haven’t analyzed this in enormous detail, but I think it’s a very natural thing that I think probably works, is that one applies this whole auction setup, but the reward that one is supposed to estimate at each step, at each step that one is deciding which action to take, the hypotheses are supposed to give an estimate of the discounted reward that they’re going to receive. Then if they win, they get the discounted reward, which means that it has to be paid out slowly over time. At time step 1000, you have to give the winner of the auction at time step 100 a tiny bit of reward. I think then probably things generally hold, with some issues that complicate things.

One issue is that hypotheses now have to take into account that on future steps exploration might occur. Let’s say I’m a hypothesis and I have some amazing plan for what to do, and the plan is to first play this action, then play that action, and then this another action and so on. I have this detailed plan for the next 100 time steps, but I have to worry that in 50 time steps there’ll be some really stupid hypotheses coming along, bidding some high number and doing some nonsense. I have to take this into account, so I have to bid less. This makes everything much more complicated. If there is a hypothesis that has a really good plan, it’s harder for that hypothesis to actually make use of this because it can’t rely on winning the auction at all these steps, so there are definitely some complications.

Relationship to game theory

Daniel Filan: Yeah. Interesting. The final thing I want to talk about in this paper is: at the end you talk about using it as some sort of foundation for game theory where if you have these BRIAs, I’m going to call them (for boundedly rational inductive agents)… I say I’m going to call it, that’s what they’re called the paper, I’m not an amazing inventor of acronyms. You talk about these BRIAs playing games with each other and they each think of it as one of these bandit problems. Essentially you say that if they have these rich enough hypothesis classes, they eventually play Nash equilibrium with each other, is my recollection.

Caspar Oesterheld: Yeah. There are different versions of the paper that actually give different results under different assumptions. There is a result that under some assumptions they give Nash equilibrium. There’s also a result that is more like a folk theorem, which kind of says that you can, for example, converge to cooperating in the prisoner’s dilemma. Maybe I could try to give a general sense of why it’s kind of complicated what happens, and why maybe sometimes it’s going to be Nash and sometimes it’s going to be something else. Okay. The first thing is that these bounded rational inductive agents, nothing hinges on randomization. In some sense that’s kind of the appealing part relative to regret minimizers. There’s no randomization, no talk about counter-factuals. You just deterministically do stuff. In particular, if you have a bounded rational inductive agent play against the copy of itself in a prisoner’s dilemma, it will converge to cooperating because whenever it cooperates, it gets a high reward, whenever it defects it gets a low reward.

So there are bidders that get high reward by recommending cooperation and bidding the value of mutual cooperation, maybe slightly below. The defect bidders, they might hope like, “okay, maybe I can get the defect/cooperate payoff”, but they can’t actually do this because if the bidder in one market achieves this - the hypothesis in one market - if it tries to do this, then its copy in the other market also does it. So whenever you actually win, you win the auction, you just get the defect/defect payoff. This would converge to cooperation. The reason why there’s a Nash equilibrium-style result nonetheless is that: if the different agents are not very similar to each other, then you might imagine that the hypotheses can try to defect in some way that’s de-correlated from the other market, or from what happens in the other auction.

The simplest setting is one where they can actually just randomize, but you could also imagine that they look at the wall and depending on the value of some pixel in the upper right or something like that, they decide whether to recommend defecting or not. If they can do this in this a decorrelated way, then they break the cooperate/cooperate equilibrium. The reason why there are different results and also why there are different versions with different results is that… I’m still unsure what the correct conclusion from it is or what the actual conclusion is. For example, I’m still unsure whether these cooperate/cooperate outcomes in the prisoner’s dilemma, whether you can achieve them naturally without fine-tuning the markets too much to be exact copies.

Daniel Filan: Yeah. I guess one way to think about it is, you’ve got this weak uncorrelation criterion where if you meet it, you fall back to Nash and another criteria and you get this. Presumably the right theorem to show is, under these circumstances you fall into this bucket and you get this, under these [other] circumstances, you fall into this bucket and you get this. If I implement my bounded rational inductive agent slightly differently, like I order the hypotheses a bit differently, do we know whether that’s going to hit the corporate/corporate equilibrium or it’s going to be weakly uncorrelated?

Caspar Oesterheld: I think even that is already not so easy to tell. I don’t know. I’ve done some experiments and generally it learns to defect in this kind of setting, but in these experiments the hypotheses are also relatively simple, so you don’t have hypotheses that try to correlate themselves across agents. For cooperating, one hope would be that you have a hypothesis in one of the markets and a hypothesis in the other market and they somehow try to coordinate to cooperate on the same rounds. This becomes quite complicated pretty quickly. I think maybe another important thing here is that there’s still a difference between the general bounded rational inductive agency criterion versus this specific auction construction. The criterion allows all kinds of additional mechanisms that you could set up to make it more likely to find the cooperate/cooperate equilibrium. You could specifically set it up so that hypotheses can request that they only be tested at various times in such a way that it’s correlated between the markets or something like that. It’s very complicated and that is still something I’m working on. Hopefully other people will think about this kind of question too.

Safe Pareto Improvements

What they try to solve

Daniel Filan: Cool stuff. Yeah. There’s actually a bunch more… there’s at least one more thing, a few more things that I would love to talk about with this bounded rational inductive agents paper, but we spent a while on that and there are two other papers that I’d like to talk about as well. Let’s move on to the next paper. This is called Safe Pareto Improvements by yourself and Vincent Conitzer. Can you just give us a sense of what’s this paper trying to do?

Caspar Oesterheld: This paper is trying to directly tackle this equilibrium selection problem: the problem that if you play a given game, it’s fundamentally ambiguous what each player should do, so you might imagine that they fail to coordinate on a good Nash equilibrium, on a Nash equilibrium at all, and they might end up in these bad outcomes. A very typical example of this is a setting where players can make demands for resources, and they can both demand some resource, and if they make conflicting demands, the demands can’t both be met, so usually something bad happens; they go to war with each other in the extreme case. Safe Pareto Improvements is a technique or an idea for how one might improve such situations, how one might improve outcomes in the face of these equilibrium selection problems.

Maybe I can illustrate this with an example. Let’s take a blackmail game. This is actually a bit different from the kind of example we give in the paper, but this is a version of this idea that people may have heard about under the name surrogate goal or surrogate goals. Let’s say that I’m going to delegate my choices to some AI agent. You can think of something like GPT-4, and what I’m going to do is I’m going to tell it, “here’s my money, here’s my bank account, my other web accounts, and you can manage these, so maybe you should do some investing”, or something like that. And now, this AI might face strategic decisions against different opponents. Right? So other people might interact with it in various ways; [they might] try to make deals with it or something like that, and it has to make decisions in the face of that.

And let’s just take some very concrete interaction. So let’s say that someone is considering whether to threaten to report my online banking account or something like that if my AI doesn’t give them $20. So they can choose to make this kind of threat. Now, maybe, you might think that it’s just bad to make this kind of threat and I should just not give in to this kind of threat. But you can imagine that there’s some kind of moral ambiguity here, so you could imagine that this person actually has a reasonable case that they can make for why I owe them $20. Maybe at some point in the past I promised them $20 and then I didn’t really give them $20, or something like that. So they have some reason to make this demand.

So now, this is a kind of equilibrium selection problem where my AI system has to decide whether to give in to this kind of threat or not. And this other person has to decide whether to make the threat, whether to insist on getting the $20 or not. And there are multiple equilibria. So the pure equilibria are the ones where my AI doesn’t give in to these kind of threats and the other person doesn’t make the threat. And so that’s one. And the other one is, my AI does give in to the threat and the other person makes this kind of threat. Okay. So a typical equilibrium selection problem, in some ways. There’s some ways in which this is a game theoretically a bit weird, this is a non-generic game, and so on. So I think this kind of example is often a bit weird to game theorists. So maybe, for game theorists, the example in the paper works a bit better. But I like this kind of example: I think it’s more intuitive to people who aren’t that deep into the game theory stuff.

Okay, so now, I’m deploying this AI system, and I’m worried about this particular strategic interaction. And in particular, maybe the case that seems worst is the case where the coordination on what the correct equilibrium is fails. So in particular, the case where my AI decides not to give in to the threat and the other person thinks like, “No, no, I should really get that $20, so I’m going to make this threat,” and then they report my online banking account and they don’t even get the $20, and so everyone’s worse off than if we hadn’t interacted at all. Basically, utility is being burned. They are spending time reporting me on this online banking platform, I have to deal with not having this bank account anymore, or have to call them, or things like that.

Alternative solutions

Caspar Oesterheld: Okay. Now, there are various ways in which you might address this. And I kind of want to first talk a bit about some other ways you might deal with this, if that’s okay, to give us a sense of what’s special about the solution that we propose. Because I think it’s kind of easier to appreciate what the point of it is if one first sees how more obvious ideas might fail.

So the first thing is that if one of us is able to credibly commit, at some point, to some course of action, they might want to do so. So I might think that the way for me to do well in this game is just that I should be the first to really make it credible that my AI system is never going to give in to this kind of threat. And I should just announce this as soon as possible, I should try to prove this, that I’m not going to give in to threats. And then maybe the other person, they would want to try to, as fast as possible, commit to ignore this kind of commitment and so on.

So this is… I think it’s a reasonable thing to think about, but it kind of feels like it’s not really going anywhere. Ultimately, to some extent, people are making these commitments simultaneously, they might also just ignore commitments, right? It seems like if you go around the world and whenever someone makes some commitment to, I don’t know, threaten you or to ignore anything you do to get them to do something, you shouldn’t be the kind of person to just give in and cave to any such commitment. You kind of have to have some kind of resistance against these kind of schemes. So I think in practice, this just doesn’t resolve the problem.

And also, it’s a very zero-sum way of approaching solving this problem. It’s all about, I am just going to try to win. Right? I’m just going to try to win by keeping my $20 and deterring you from even threatening me. And you might say, “Well, I’m going to deter that. I really want the $20, and I’m going to be first.” Right? So…

Daniel Filan: It’s also very… I don’t know, if you imagine implementing this slightly more realistically… people learn things over time, they understand more facts. And if you’re racing to make commitments, you’re like, “Oh yeah, I’m going to determine what I’m going to do when I know as little as possible.” It’s not an amazing…

Caspar Oesterheld: Yeah. Daniel Kokotajlo has this post on, I think, LessWrong or the Alignment Forum titled Commitment Races or something like that. It’s also kind of about this idea that one wants to commit when one knows as little as possible, and that seems kind of problematic.

Daniel Filan: Yeah.

Caspar Oesterheld: Okay. So that’s one solution. Another solution might be that I could try to just pay you, offer you $5 or something like that in return for you not blackmailing me. So we make some outside deal, and the idea is that if we do make this deal, then okay, I still have to pay some amount of money, but at least this inefficiency of utility really being burned, it disappears if this deal is made. But then on the level of figuring out what deal to make, we get all the same problems again. I might offer $5 and the other person might say, “Well, actually, you really really owe me those $20, so obviously I’m not going to take this $5, I want at least $18,” something like that. And so you get the same problem again.

Daniel Filan: Yeah, it’s funny… Yeah, I find this a weird thing about game theory where intuitively, talking out problems seems like an easier way to solve stuff, but whenever you try to model bargaining, I don’t know, it’s kind of horrible. I’ve never seen any convincing analysis of it.

Caspar Oesterheld: Yeah.

Daniel Filan: Just very strange.

Caspar Oesterheld: And I do think that this problem, it’s very fundamental - this equilibrium selection, what’s the appropriate way of distributing resources? There are lots of approaches and they’re all good, but I do think that fundamentally, this is just a problem that is hard to get rid of entirely.

How safe Pareto improvements work

Caspar Oesterheld: But now, okay, now comes the ‘safe Pareto improvements’ or ‘surrogate goals’ idea for making progress on this. Remember, I’m deploying my AI system. I could do the following. So let’s say that by default, the way this delegation works is that I’m going to tell my AI everything that I want it to do. I’m telling it like, “Okay, here’s my money and I’m such and such risk averse. I really want to make sure that I always have at least such and such amount in my bank account in case something happens.” And also, maybe I can tell it that there’s some stamp that I really like and if this stamp appears on eBay for a price of less than $30, it should try to get it, and these kind of things.

So normally, I just honestly tell it my preferences and then I say “do your best”. Now, what I could do instead, for the purpose of this particular interaction, is the following. I first set up a dummy bank account that I don’t care about at all. So I set up some new online banking account similar to the online banking account that the other person might threaten to report. And I don’t care about this at all, but I tell my AI system to care about this bank account as much as I care about the actual online banking account. So I tell it, “Okay, if this were to be reported, that would be just as bad as if the other one is being reported.” And I have to do that in a way that’s credible, so that’s important here. The other person needs to see that I’m doing this.

So let’s say that I do this. And let’s say that, in addition, I tell my AI to not give in to threats against my actual original banking account. Now, why is this an appealing idea? Basically, the idea is that from the perspective of the person who’s thinking about threatening to report my banking account, nothing really has changed. Right? They can still threaten me, and they can still be equally successful at threatening me, because they have to threaten to report this different account now. But to my AI, that’s just the same as it would’ve been by default. It’s like to them, nothing’s really different. They don’t feel like I’m tricking them or anything. They’re completely fine with this happening.

But meanwhile, for me, there’s some chance that things improve relative to the default. In particular, they might still make this threat again now, to report this new dummy account. And it might be that my AI just gives in to that threat, right? In which case I think, “okay, this is kind of funny, but okay, that was part of the plan”. But it could also be that my AI resists, decides not to give in to this new kind of threat. Probably that’s just as likely as it would’ve been to not give into the original kind of threat. And in this case, if this happens, if the other person threatens and my AI doesn’t give into the threat, then I am better off than I would’ve been by default, because now, they’re going to report this dummy account that I don’t actually care about. So I’m just fine. It’s just as if no threat had been made. Of course, for my AI, it might still be very sad. So my AI might still think, “Oh, I’ve done a terrible job. I was instructed to protect this dummy account and now it’s been reported. I’m a bad Bing,” or something. But to me, it’s better than it would’ve been by default. And again, to the other person, it’s kind of just the same.

And that’s kind of what safe Pareto improvements mean in general. It’s this idea of making some kind of commitment or some modification of the utility functions, or some kind of way of transforming a game that ensures that everyone is at least as well off as they would’ve been if the game had been played in the default way. But under some potential outcomes, there’s some Pareto improvements. So some person is better off, or everyone’s better off, without making anyone worse off. One important part is that it’s agnostic about how this equilibrium selection problem is resolved. To make this commitment, or to tell my AI to do this, I don’t need to think about how the equilibrium selection problem in the underlying game is going to be resolved. I can make this safely without having to make any guesses about this. And the other player, similarly, they don’t need to rely on any kind of guess about how it’s going to be resolved.

Daniel Filan: Gotcha. So, actually, okay, I have an initial clarifying question where I think I know the answer. You call it a safe Pareto improvement. What’s an unsafe Pareto improvement?

Caspar Oesterheld: Yeah, good question. So the safety part is supposed to be this aspect that it doesn’t rely on guesses as to how the equilibrium selection stuff is going to work out. So an unsafe Pareto improvement might be something like, I’m transferring you $10 so that we don’t play this game, or something like that. Which is unsafe in the sense that it’s hard to tell how we would’ve played the game, and I actually don’t know whether it’s an improvement, or we don’t know whether it’s a Pareto improvement to do this deal. It relies on specific estimates about how we would’ve played this game. So yeah, that’s what the term ‘safe’ is supposed to mean. Maybe it’s not the optimal term, but that’s what it’s supposed to mean.

Daniel Filan: Gotcha. So one thing it seems like you care about is, these instructions I’m supposed to give to my AI, they’re not just supposed to make my life safely better off, they’re also supposed to make the other guy’s life safely better off. Why do I care about that? Shouldn’t it all be about me?

Caspar Oesterheld: So, yeah, that’s a good question. So that’s kind of coming from this intuition that these kind of races to commit first can’t be won. So that if I tried to come up with some scheme that commits my AI in such a way that it screws you over, then maybe you should have already committed to punishing me if I implement that scheme. Or you have some reason to try to commit as fast as possible to punish me if I try to commit that scheme, or to ignore if I try to commit this scheme. So there are all these issues, and the idea is to avoid all of this by having something that’s fine for both players so that no one minds this being implemented, everyone’s happy for this to be implemented. And so all of these competitive dynamics that otherwise are an obstacle to implementing “I just commit first” approaches, they disappear.

Daniel Filan: Gotcha. So if I kind of think about this scheme, it seems like the suggested plan, roughly, is: I have this really smart AI that knows a bunch of more things than me. In the real world, I don’t know exactly what it’s going to do or what plans it could potentially think of. And the safe Pareto improvement literature basically instructs me to think of a way that I can deliberately misalign my AI with my preferences, right?

Caspar Oesterheld: Yeah.

Daniel Filan: It seems like this could go wrong easily, right? Especially because in the safe Pareto improvements paper, you’re assuming that the principal, the person who’s doing the delegating of the game-playing, knows the game. But in real life, that might not be true. So how applicable do you think this is in real life?

Caspar Oesterheld: Yeah. In real life, one will have to do things that are much more complicated, and I think the real life surrogate goals will be much more meta. In this scheme with this very simple AI and this blackmail setting, the game is very simple. There’s binary choices, and so on. And also, in this scheme, maybe my AI doesn’t really know what’s going on. It might not understand why I’m giving it these instructions, it might just be confused, like, “Okay, well I guess this is what I’m supposed to do.” I think the more realistic way to implement this is to give some kind of meta instruction to “adopt a surrogate goal in the way I would’ve liked you to do”, or something like that.

Daniel Filan: So somehow delegate to the machine… Not only is it finding equilibria, it’s also trying to figure out what the SPI would be, and…

Caspar Oesterheld: Yeah. There are different aspects that one can delegate. Maybe one slightly more complex setting is a setting where, very roughly, it’s clear what is going to happen. Maybe very roughly, it’s clear that I’m going to deploy an AI and you can make some kind of threat against it or try to blackmail it in some ways, but it’s not clear to me how costly it is for you to make different kinds of threats. And then basically, I would have to… Implementing the surrogate goal requires knowing these things, right? Because I have to make the new thing - the new target - I have to make it somehow equivalent to the old one. And this is the kind of thing that one probably should delegate to the AI system in the real world.

Daniel Filan: Gotcha.

Caspar Oesterheld: A more radical approach is to just give it some entirely generic instruction that just says, like, “Okay, whenever you are in any kind of strategic scenario, where I might have no idea what that scenario will be, whenever you face any kind of strategic scenario, first think about safe Pareto improvements, and potentially implement such an improvement, if it exists.”

Will players fight over which safe Pareto improvement to adopt?

Daniel Filan: So this kind of gets to a question where… I mean, one of the things that safe Pareto improvements were supposed to do is deal with multiple equilibria and maybe people picking different equilibria. It seems like there are potentially tons of conflicting - or different - safe Pareto improvements, right?

Caspar Oesterheld: Yeah.

Daniel Filan: In fact, because they have such a larger action space, I would guess that there’d be even more equilibria in the “find a safe Pareto improvement” game. So are we getting very much, if it’s just really hard to coordinate on a good SPI?

Caspar Oesterheld: Yeah. Also a very important question. So I think maybe first, it’s good to get an intuition for why these many safe Pareto improvements might exist. Because I think in the example that I gave, there actually is only one, because only one player can commit. And I think, okay, depending on what else exists in that world, there might exist only one. But yeah, let’s maybe give an example that makes clear why there might be many. So in the paper, we have this example of the demand game, which is just a game where there’s some bit of territory and two countries can try to send their military to take that bit of territory, but if they both send out their military, then there’s a military conflict over the territory. So that’s kind of the base game. And then the idea is that they could try to jointly commit to…

Okay, sorry, one step back, let’s also assume that the two countries make this decision by delegating to some commission or some expert who is thinking about what the appropriate equilibrium is. And then the idea there is that they could instruct the commission to adopt some new attitudes towards this game. So they would say, “Okay, never mind the military, let’s just send someone with a flag.” Like, “Let’s just decide whether to send someone with a flag who just puts the flag in the ground and says, ‘This is now ours.’” And then we just tell them, “Well, okay, if both of our countries send someone with a flag to that territory and put in the flag, that’s really, really bad. That’s just as bad as war.” And so this is a safe Pareto improvement in this situation. Is this setting somewhat clear? I guess you’ve looked at the paper so maybe to you it’s…

Daniel Filan: Yeah, roughly, the players can either send in the military or they can just send a guy with a flag or they can send nothing. And if there’s clashes, that’s bad. But in real life, clashes are worse if there’s military. And if just one of the players does it, then the player who sends the most stuff gets the land and they want to have the land. It seems like that’s roughly the situation.

Caspar Oesterheld: Yeah. Thanks. So the safe Pareto improvement is to not send the military, which is what one would normally do, just send the guy with the flag. And then one avoids this conflict outcome where both send the military. And here, you could imagine that in some sense, it’s kind of ambiguous what to do instead of this conflict outcome. Right?

Because currently…okay, we have this “guy with a flag” story. What exactly happens with the territory if both countries send a guy with a flag? It’s kind of just left open. I think the paper just specifies that in that case it’s split, or something like that. But it could be that if both players send someone with a flag, then just country A gets the territory. And then it’s still a safe Pareto improvement because it might still be better to have just country A get the territory than to have a war over the territory, because war is not so great.

So here, there are these many safe Pareto improvements that are characterized by what happens instead of war. Like, instead of war, does one player just get the resource, or do both players… Do they split it or does the other player get the resource? Something like that.

Okay, so now the question is: does this mean that we get the same problem one level up? Or it’s just as bad, or maybe it just doesn’t help. And I think this depends a lot on the setting. I think in some settings, safe Pareto improvements really literally do nothing. They don’t help at all, because of this. And in other settings, they still help. And roughly, the settings where it helps are ones where the bad outcome that we’re replacing with something else is just worse for both players than anything on the Pareto frontier.

For example, let’s imagine that in this demand game setting where there’s this territory that two countries are having a dispute over, war is worse for both players than it is to even just not get the territory in the first place. So in that case, even the worst safe Pareto improvement that you can get, namely the one where instead of war the other person just gets the resource, is still an improvement. It’s still a Pareto improvement, it’s still an improvement for both players. And so in particular, if you’re mostly just worried about war, and maybe it’s not that terrible for you to give up this territory, then this is… You might say, okay, when we meet to decide which safe Pareto improvements to get, you’re kind of willing to just settle for the one that’s worst for you, and it’s still a gain overall.

This other example, this surrogate goal example, is another case of this, where the worst outcome for me in that case, is the one where a threat is being carried out and my bank account is being reported. And so even if we somehow make it so that in the case where my dummy account is reported, I still give the $20, which only works if the other player also makes some kind of commitment, then this is still an improvement for me. Right? So I might still be okay with just giving up the $20 in that case. The important condition, I think, is that the bad outcome that we’re getting rid of is worse than anything on the Pareto frontier. Such that even if I get the worst thing on the Pareto frontier, it’s still good for me.

Daniel Filan: But even then, the players have to agree which thing on the Pareto frontier they go for, right?

Caspar Oesterheld: Yeah.

Daniel Filan: And yeah, you have this argument in your paper that I wasn’t totally compelled by. So my recollection was, you basically said, “Well, most of the players can just say, ‘Look, if any other player recommends a safe Pareto improvement, I’ll go for that one.’ And we just need one person to actually think of something.” But then it’s like, well, who actually submits the safe Pareto improvement? Or what if multiple people do? Or what if I say… You can imagine, I say, “If someone else submits a safe Pareto improvement, then I’ll go for it. But if they don’t, I want this one.” And you have a similar instruction but you have a different fallback SPI… it still seems quite difficult to figure out how to break that tie.

Caspar Oesterheld: I mean, I’m not sure I exactly understand. Let’s take the case where I’m happy to just implement the worst possible safe Pareto improvement, where we replace the conflict outcome with me giving you the territory. If it’s a two-player game and I have this attitude and I say, “Okay, you can have the territory, in that case.” And let’s say I’m really happy, so I state this in the beginning of the negotiations, “I’m actually happy if you just take it.” Right? Then is there any remaining problem in that case?

Daniel Filan: Well, I don’t know what… I don’t know, maybe it’s a happier problem, but suppose we both come to the table and we both say, “Hey, I’m happy with any Pareto improvement over the worst one.”

Caspar Oesterheld: Ah.

Daniel Filan: Well, we still have to figure out what we get. Right?

Caspar Oesterheld: Right.

Daniel Filan: And still, on this level, it seems like you might want to try and threaten to get your preferred SPI, otherwise… “If you don’t agree to my SPI rather than your SPI, then screw it, we’re going to war.”

Caspar Oesterheld: But okay, so if both players are kind of happy to give the other person their favorite SPI, and it’s just the issue that, I don’t know, there’s a large set of different things that both people would be okay with, or…

Daniel Filan: Yeah. And I have SPIs that I’d prefer, right? And you have SPIs that you’d prefer, and they might not be the same ones.

Caspar Oesterheld: Right. Yeah. To me, this seems like a much easier problem than equilibrium selection.

Daniel Filan: Yeah.

Caspar Oesterheld: Because this is just the case where players are, basically… it’s almost this like, “No, you go first,” “You go first,” type problem.

Daniel Filan: I mean, to me, it sounds like a bargaining problem, right?

Caspar Oesterheld: But it’s only… It doesn’t seem like a bargaining problem anymore, once it’s clear that the players have overlap. Right? It’s a bargaining problem for as long as it’s unclear which SPI to go for. And I mean, yeah, if you really want to max out, you want to get the best safe Pareto improvements for you, and you also want to get the best safe Pareto improvement for you, and we both go to the table saying, “Okay, I actually really want this,” and you say you really want this. Okay, then there’s a risk. But if my argument is that even by going for the… You can just say, “Okay, you can have whatever you want in terms of safe Pareto improvements,” and even this attitude already improves things.

Daniel Filan: Yeah. I guess… maybe I shouldn’t belabor the point, but it still seems almost identical to these bargaining scenarios where there’s this best alternative to negotiated agreement, and we can all do better, but there are a few different mutually incompatible options to do better, and we have to figure out which one we want. And I ideally would get the one that’s best for me, and you ideally would get the one that’s best for you. That seems like the situation we’re in.

Caspar Oesterheld: Okay, let’s say we have some other bargaining scenario, where the two of us start a startup together, and we’re both needed for the startup, and so we have to come to an agreement on how to split the shares for the startup. Right? So this is a very typical bargaining problem. We have to figure out how much to demand, or what fraction of the startup to demand. But now, if I come to this negotiation table and just say, “I’ll accept whatever demands you want as long as it’s better for me to do the startup than to not do the startup.”

So let’s say that as long as I get 20% of the startup, it’s still better for me in terms of how wealthy I’m going to be in five years, or whatever, to be part of the startup than to not be. And let’s say that this is common knowledge. Then I might say, “Okay, as long as I get at least 20%, I’m fine.” It seems that if for some reason, I’m happy to adopt this attitude, then I would think this bargaining problem becomes very easy. Even, you might say, theoretically, it could be that you have a similar attitude and you say, “Well, actually for me, I’m also happy with just getting my minimum of 45% or whatever.” And okay, then we have this problem where we have the remaining 35% to distribute, but…

Daniel Filan: I mean, in some sense, it’s easy, but also, agents like this will get arbitrarily… The amount you’ll improve off the base game will be arbitrarily small if you’re like this and the other player’s like, “I would like everything as much as I can.”

Caspar Oesterheld: Yeah, so in this type of bargaining setup, that’s true. So in this type of bargaining setup, this attitude of just “I’m happy to accept the minimum,” is bad. You shouldn’t adopt this attitude. You should try to get more, right? You should try to make some demands, and thus risk that conflicting demands are made. Because yeah, if you just make the minimum demands, you never make any money beyond what you could have made without the startup, right? In some sense, you don’t gain anything from this whole startup, because you only demand the absolute minimum that makes it worth it for you to do the startup. The point that I’m making is that in the case of safe Pareto improvements, even this kind of minimalist, this really dovish bargaining approach to deciding which SPI to go for is still much better, potentially, than not doing anything.

Daniel Filan: So the idea is that just all of the options are just significantly better than the BATNA, basically.

Caspar Oesterheld: Yeah, so this is specifically if the outcome that you replace is some really bad thing, like going to war with each other, or you can think of even more ghastly things if you want. Then anything, even giving up the territory, or maybe giving up everything or something like that, it might still be better than this conflict outcome.

Relationship to program equilibrium

Daniel Filan: Got you. So there’s a bunch of other things to talk about here. So one thing that I was thinking about when I was reading this paper is: it seems sort of analogous to the program equilibrium literature, right? So you write a computer program to play a game, I write a computer program to play a game, but our programs can read each other’s source code, right? And the initial papers in this literature, they considered computer programs that just checked “if our programs are literally equal, then they cooperate, otherwise they defect” or something.

And then I think in advance of this… In this literature that came out of MIRI, I believe, the Machine Intelligence Research Institute, was to think about this thing they called modal combat, where I try and prove properties about your program and you try and prove properties about my program. Then through the magic of Löb’s Theorem, it turns out that we can cooperate even if we can only search through proofs of however long.

I’m wondering… So in the SPI paper, the paper kind of explicitly envisions [that] you send a particular SPI to your decision-making committee, and it says if the other person sent the exact same thing, then implement it, otherwise, fall back on your default. And I’m wondering, can we make this modal combat step where… Sorry, it’s called… Actually, I’m not even going to explain why it’s called modal combat. People can Google that if they want. But is there some way to go a little bit more meta or abstract?

Caspar Oesterheld: So yeah, one can use these kinds of mechanisms, like the modal combats, the Löbian FairBot that the researchers at MIRI propose for establishing cooperative equilibrium in this program setting. I mean, we can use that to achieve these safe Pareto improvements.

So the original safe Pareto improvement that I described, the first one that I described, the surrogate goal idea where only I modify my AI and you don’t, right? That doesn’t really require this, but if you have this case where two players each have to give some new instructions to their AI or their committee that is deciding whether to send troops to the territory or something like that, in that case, usually the safe Pareto improvements have to be backed up by some joint commitment. So both sides, they have to commit that, I don’t know, we are going to tell our committee to send the guy with a flag rather than the troops, but we only do this conditional on the other side making an analogous commitment.

And yeah, one way to back this up is to have this kind of program equilibrium-type setup where both countries, they write some contract or something like that for their commission. And the contracts, they are computer programs that look at the other country’s contract. And depending on what that contract says, it gives different instructions for how to reason about the guy with the flag versus the troops.

And this contract, if you think of it as literally a computer program, as seems reasonable in the AI case, then yeah, you could use these Löbian ideas for implementing this joint commitment. So you could say… Yeah, I’m not sure how much to go into the details of the Löbian FairBot, where you can show… If I can prove that the other side adopts their side of the safe Pareto improvements, then I adopt my side of the safe Pareto improvements. Otherwise, I just give the default instructions. And then if both sides make this commitment, then it will result in both giving the safe Pareto improvement instructions to their committees. Is that what you had in mind, or-

Daniel Filan: Yeah, that sort of thing. I guess there’s a difficulty… I mean, you might hope that you would not have to specify exactly what the SPI has to end up being, but I guess the trouble is precisely because you’re assuming you don’t know how the committee solves for the equilibrium. Presumably, your program can’t try and prove things about what the other solver is going to go for. Because if you could do that, then you could just say, “Go for this nice outcome,” or something.

Caspar Oesterheld: So there are two obstacles, I guess. The first is that potentially you can’t predict what the other committee is going to do or how it’s going to resolve the equilibrium selection problem. But the other is also that you don’t want to know, in some sense, right? Or you don’t want to adopt a policy of first predicting what the other committee does and then doing whatever is best against that, right? Because then the other committee can just say, “Well, this is what’s going to happen. We’re just going to demand that we’re going to send the troops.” Best response to sending the troops is not to send troops or the guy with the flag.

Do safe Pareto improvements break themselves?

Daniel Filan: Got you. I guess [there’s] a bunch of things I could ask, but the final thing I wanted to ask about here was… So there’s a critique of this line of research, I think this LessWrong post by Vojta Kovarik, where one of the things mentioned is: it seems that implicit in the paper is this idea that the way the committee solves games is the same with or without the safe Pareto improvements potentially existing, and all that the safe Pareto improvements do is just change which game the equilibrium selection mechanism plays.

But you could imagine, well, if I know I’m in a world that works like committees giving instructions to people, you could imagine that this potentially does change how people make decisions, and that potentially seriously limits… You could imagine that this limits the applicability of this research, and I’m wondering, how serious a limitation do you think this is?

Caspar Oesterheld: Yeah, so I do think that this is a limitation. This is something that makes it non-applicable in some cases. I mean, there’s even just the more basic worry that you might… For example, in this first AI case that I described, you might… Never mind influencing the way my AI reasons about games, right? I might just tell it, “Okay, actually, secretly, don’t give into threats against this dummy bank account,” right? If I can secretly say this to the AI, then already there’s a problem. So we have to assume that that’s not possible. And then there’s the fuzzier problem that my AI’s bargaining strategy can’t depend in some sense on the existence of safe Pareto improvements. I think in some settings, this really is just a problem that makes this very difficult.

So here’s an example where I think it’s clear that it’s a big problem. Let’s imagine that I am delegating to AI, but it’s kind of unclear which AI I’m going to delegate to. I can decide which AI to delegate to, and there are 10 different AIs on the market that I can delegate my finances to or something like that. And now I can decide which of them to hire to take care of my finances. And if I know that safe Pareto improvements will be used, I have some reason to hire an AI that’s more hawkish, that’s less likely to give into threats, because I think that it’s more likely that threats are going to be made against the surrogate goal.

So in response, the threatener might think, “Okay, if I go along with this whole surrogate goal idea, then there’s a good chance that I’m going to be screwed over,” and so they should just basically ignore the whole surrogate goal stuff and say like, “Okay, sorry, I don’t want to do the surrogate goal business, because I don’t know what AI you would’ve rented by default, and so I can’t really judge whether I’m being screwed over here.” So it definitely can be a problem in this setting.

Meanwhile, I think maybe in other cases, it is clear what the default would be. So for example, it might be clear that I generally, in cases where safe Pareto improvements don’t apply for other reasons - for example, because no credible commitment is possible or something like that - I might always delegate my choices to a particular AI. And then in cases where I want to apply my safe Pareto improvements or my surrogate goals or whatever, it seems clear that if I just use the same AI as I use in other cases, then in some sense, my choice of which bargaining strategy is deployed is not being influenced by the existence of safe Pareto improvements.

At least one way in which this can work is that you have access to the ground truth of what people would do without safe Pareto improvements. For example, by being able to observe what people intend to do in scenarios without safe Pareto improvements.

Similarity-based Cooperation

Daniel Filan: Got you. Okay. So the last paper I’d like to chat about is this paper on similarity-based cooperation. So this is co-authored by yourself, Johannes Treutlein, Roger Grosse, Vincent Conitzer and Jakob Foerster. So can you give us a sense of what’s this paper about?

Caspar Oesterheld: Sure. I guess in some sense, you’ve already set it up very well with some of this open source game theory or program equilibrium stuff that you talked about earlier. So yeah, that’s the setting where two players, they each write some source code and then the programs get access to each other’s source code and they choose an action from a given game.

So for example, in the prisoner’s dilemma, you would submit a computer program that takes the opponent’s computer program as input and then outputs cooperate or defect, and it has been shown by this literature that this kind of setup allows for new cooperative equilibria. And the simplest one is just “cooperate if the opponent is equal to this program, otherwise, defect”.

So similarity-based cooperation… This paper considers a setting that is in some sense similar. So we again have two players that submit some kind of program or policy or something like that, and that gets some information about the opponent policy or program. The main difference is that we imagine that you only get fairly specific information about the opponent. You don’t get to see their entire source code. You only get a signal about how similar they are to you. So in most of the settings that we consider in the paper, one just gets to observe a single number that describes how similar the two policies are. And the policies or programs essentially just get as input a single number and then they output, maybe stochastically, whether to cooperate or defect or what to do in the base game.

One can show that in this setting, still one can get cooperative equilibria. In some sense, it’s not too surprising, right, because the cooperative equilibrium in this program equilibrium case that we discussed, this “if the opponent is equal to this program, then corporate, otherwise, defect”… In some sense, that is a similarity-based program: “if the similarity is 100%, then corporate, otherwise, defect”. But as we show in the paper, there are more interesting things that can happen, less rigid ways of cooperating, and you can apply this to other games and so on.

Daniel Filan: Yeah. And in particular there’s a strange theorem where… so you basically have this noisy observation of a similarity function (or a difference function, I guess is the way you frame it). But I think there’s this theorem that says if you don’t put any constraints on the difference function, then you can get, I think it was something like every outcome that’s better than mini-max payoff can be realized or something. Which can be worse than Nash equilibrium, right?

Caspar Oesterheld: Yeah, that can be worse than all Nash equilibria of the game

Daniel Filan: So I guess somehow, at least in some classes of games, it’s better if the thing you’re observing is actual similarity rather than arbitrary things.

Caspar Oesterheld: Yes. So that’s this folk theorem result which says which outcomes can occur in equilibrium. Yeah, it’s surprisingly exactly the same as in program equilibrium if you don’t constrain the diff function. The way in which these weird equilibria are obtained is very non-natural. It requires that the diff function, for example, is completely asymmetric in symmetric games and things like that. To avoid this one needs natural diff functions or natural ways of observing how similar one is, in some sense of natural. We have some results in the paper also about… In some sense, it says the opposite: that says if you’re under certain conditions, that admittedly are quite strong on the game and on the way that the similarity is observed, you don’t get this folk theorem, you get a much more kind of restricted set of equilibria.

Daniel Filan: Yeah. It seemed like you had relatively weak criteria. So that theorem roughly says under some criterion on the difference function that struck me as relatively weak, but on a symmetric game, then you get, as long as there are non-Pareto-dominated Nash equilibria, then they have equal payoffs and there must be the best payoff. So it seems like, I guess you have this restriction that it’s a symmetric game and also that the Nash equilibria are non-Pareto-dominated. Or I guess there must be… hang on. There always has to be some non-Pareto-dominated Nash equilibria, right?

Caspar Oesterheld: Yeah, with some weird analysis caveats, right? The set of equilibria might be some open set, something something. But it’s probably not so reasonable to get into the details of that. But yeah, I think it’s reasonable to assume that there always is a Nash equilibrium that’s not Pareto-dominated by another Nash equilibrium. Yeah, the paper says that if you have any such Nash equilibrium, it has to be symmetric, so it gives both players the same payoff.

Daniel Filan: Got you. So this is an interesting paper. In some ways it’s interesting that it somehow gets you the good qualities you wanted out of program equilibrium by getting these nice cooperative outcomes in symmetric games while providing less information. And at least in this restricted setting, you get less information than the full program and you get better results. Yeah. I wonder if this is just one of these things where more options in game theory can hurt you. Because on some level it’s a little bit surprising, right?

Caspar Oesterheld: Yeah, I think it is an illustration of that. By being able to fully observe each other, you get all of these different equilibria including weird, asymmetric bad ones. So being able to fully observe each other’s source code, yeah, in some sense it makes things worse because there’s much more to choose from. Whereas under certain conditions, I mean, as far as our paper goes, you avoid this problem if you have the more limited option of just accessing how similar you are to the opponent.

Are similarity-based cooperators overly cliqueish?

Daniel Filan: Yeah. So speaking of restrictions on the difference function, one thing that struck me is that in real life, “cooperate with people who are very similar to you and otherwise defect” is not an outcome that we aspire to. And I’m wondering: does it work - it seems like you want something where people cooperate just as long as they’re just similar enough to cooperate, even if they disagree about what’s fair in some sub-game that we’re not actually going to get into or something. You want minimal agreement to be able to get cooperation. And I am wondering: how does this fit in with this setting?

Caspar Oesterheld: Yeah, it’s a good question. I think an important thing to clarify about how this would work, I think the important thing is that to get this cooperative equilibrium, one really needs a signal of how similar one is with respect to playing this game that one is currently playing, or with respect to how cooperatively one approaches the game that one is currently playing.

And in particular, all other kinds of signals about similarity are completely useless, right? If we play a game and we get a signal about, I don’t know, whether we have the same hair color or something like that, that’s completely useless for how we should play the game, presumably, unless the game is about some specific thing that relates to our hair color. But that signal is useless. And also, probably if you get a super broad signal that just says, “Well, in general, you’re kind of pretty similar,” it’s probably not even sufficient to get the cooperative equilibria because… Okay, I mean, if the signal says you’re exact copies, then that’s sufficient. But if the signal says you’re 99% the same and there’s just 1% that you are kind of different, 1% of your source code is different or something like that, well, it might be that this 1% is exactly the part that matters, the part that decides whether to cooperate or defect in this game. So yeah, what really matters is this strategic similarity.

Daniel Filan: Yeah, I think there’s an in-between zone though, where suppose I say, “Hey, I’m going to cooperate with people who cooperate with me. But if we don’t reach a cooperative equilibrium, I’m going to defect in this one way.” Suppose you say, “Oh, I’m going to cooperate with people who are willing to cooperate with me. But if we don’t manage to cooperate, I’m going to have this different method of dealing with the breakdown case.” Now intuitively, you’d hope that there’d be some good similarity metric that we could observe where this counts as similar and we end up cooperating somehow. I’m wondering does that happen in this formalism?

Caspar Oesterheld: Yeah. Okay, I mean, our formalism generally, it doesn’t necessarily restrict the diff functions that much. I mean, it definitely allows diff functions that only depend on what you do against similar players. So the kind of diff function that you’re describing is: we say that players are similar if they do similar things, when they observe that they’re facing a similar opponent, and otherwise we regard them as different. And I think that would be sufficient for getting cooperative equilibria. In some sense, I think that’s kind of the minimum signal that you need.

Daniel Filan: Yeah. I guess in some sense the question is: can we come up with a minimally informative signal that still yields maximal cooperation or something?

Caspar Oesterheld: Yeah.

Sensitivity to noise

Daniel Filan: All right. So I now have, I guess, some questions about the details of the paper. So one thing that kind of surprised me is: I think in proposition 2 of the paper you’re looking at these policies where if our similarity is under this threshold, cooperate, otherwise defect. And the similarity measure agents observe is the absolute value of the difference between the thresholds plus some zero mean random noise (or let’s say Gaussian). And in proposition 2 it says that if the noise is mean zero, even if the standard deviation is super tiny, if I read it correctly, it says that the policies are defecting against each other at least half the time.

Caspar Oesterheld: Yeah. And that’s under particular payoffs of the Prisoner’s Dilemma.

Daniel Filan: That strikes me as rough. That seems pretty… I would’ve imagined that I could have been able to do better there, especially with arbitrarily tiny noise, right?

Caspar Oesterheld: Yeah, generally the way noise affects what equilibria there are is kind of counterintuitive in the paper or in the setting that we consider. So I mean, there’s also another result that’s surprisingly positive. So if you have noise that’s uniform between zero and some number, X, then in this setting where… Each player submits a threshold, like cooperate below, defect above, and then the diff that they observe is the difference plus noise, let’s say, uniform from zero to some number, X.

Regardless of X… You might think like, okay, with higher X means more noise, right? So if there’s higher X, you might think the equilibrium must get worse, right? It’s like at some high enough X, it kind of stops working. But it turns out that it basically doesn’t matter, it’s just completely scale invariant. Even if the noise is uniform from 0 to a 100,000 there’s still a fully cooperative equilibrium.

Daniel Filan: A fully what? A fully cooperative equilibrium?

Caspar Oesterheld: Yeah, a fully cooperative equilibrium. So that’s one where both players cooperate with probability 1 for the diff that is in fact observed.

Training neural nets to do similarity-based cooperation

Daniel Filan: Interesting. So one thing that the paper sort of reminded me of is… or it seemed vaguely reminiscent to me of this algorithm called LOLA, or Learning with Opponent-Learning Awareness, where basically when you’re learning to play a game, you don’t only think about how your action gets you reward, but you also think about how your action changes how the opponent learns, which later changes how much reward you get. And the reason that this matters is that you have some experiments of actually just doing similarity-based cooperation or training in a method that’s inspired by this with neural networks. So I think the thing you do is you study alternate best response learning, which if I understand correctly is, you train one network to respond well to the other network, then you train that network to respond well to the first network and you just keep on doing this. And basically you find something like: you do your similarity-based cooperation thing and it ends up working well, roughly. Is that a fair summary of what happens?

Caspar Oesterheld: Yeah, though it’s a bit more complicated. Okay, so we have this setting where each player submits, let’s say, a neural net, and then they observe how similar they are to each other and then they play the game, and then let’s grant that this setting has a cooperative equilibrium where they cooperate if they’re similar and defect the more dissimilar they are.

So there’s this problem still of finding the cooperative equilibrium. So yeah, you don’t know what exactly the neural nets are. This is some complex setting where they also need… Cooperation isn’t just pressing the cooperate button, it’s computing some function, and defecting is also some other function. So you need to do some ML to even find strategies. Even defecting is not so easy. You have to compute some function.

So what we do is indeed this alternating best response training, which is exactly what you described. The thing though is that if one just initializes the nets randomly and then one does the alternating best response training, then one converges to the defect-defect equilibrium. The reason for that, I think… I mean, one never knows, right, with ML. But I think the reason is probably just that this defect-defect equilibrium is much easier to find. And there’s this bootstrapping problem in finding the cooperative equilibrium. The reason to learn this similarity-based cooperation scheme is that the other player also plays the similarity based cooperation scheme and you want to be similar to them, right? But if they’re now randomly initialized, you just want to defect.

Daniel Filan: Basically you just need to observe some similarity in order to exploit similarity, and by default you just never observe it. Is that roughly right?

Caspar Oesterheld: Well, actually, if you initialize two neural nets randomly, right, if they’re large and so on, right, they’ll actually be pretty similar to each other, because they just compute some statistical average. The issue is just that it doesn’t pay off to become more similar to the other net, because the other net just does random stuff, right? So if you want to do well against the random neural network that doesn’t use the similarity value and just does random nonsense, the best thing to do is just to defect.

Daniel Filan: So you have to be similar and reward similarity.

Caspar Oesterheld: Yeah, exactly. So you have to be similar or… I mean, I guess the most important part is that to learn to do the scheme, the other person basically has to already have implemented the scheme to some extent. If they’re just doing something else, if they always cooperate or always defect or just do some random nonsense, then there’s no reason to adopt the scheme.

It still doesn’t hurt to adopt the scheme. If you adopt the scheme, they’ll be dissimilar, and so you’ll defect against them. So it’s just as good as defecting, but it’s a complicated scheme, right? You have to learn how to exactly decrease your amount of cooperation with how dissimilar they are, and you need to learn how to cooperate, which in this setting that we study experimentally is actually hard. So they need to set up this complicated structure and there’s no pressure towards having this complicated structure if the opponent is just random. So there’s no pressure ever towards having this complicated structure.

And I should say that this is very normal in other settings as well. So for example, it’s also not so easy to get learners to learn to play tit-for-tat. The reason for that is kind of similar: that in the beginning, if your opponent is randomly initialized, mostly you just learn to defect. And if you both start learning to defect, you never learn that you should cooperate and do this tit for tat thing or whatever, because your opponent is just defecting, right? So to get to the better equilibrium of both playing tit for tat or something like that, you somehow need to coordinate to switch from both always defecting to both always doing this tit for tat, which doesn’t randomly happen.

Daniel Filan: It’s almost reminiscent of the problem of babbling equilibria, right? So, for listeners who might not know: suppose you’ve got some communication game where agents want to communicate things to each other, there’s this problem where initially I can just talk nonsense and it means nothing, and you can just ignore what I’m saying. And that’s an equilibrium because if you’re not listening, why should I bother to say anything other than nonsense? And if I’m saying nonsense, why should you listen to it? Is that exactly the same or am I just drawing loose associations?

Caspar Oesterheld: No, I think it’s basically the same problem. I think in all of these cases, I think fundamentally the problem is that there’s some kind of cooperative structure that only pays off if the other player also has the cooperative structure. And so as long as neither player has the cooperative structure, there’s never any pressure to get the cooperative structure: the communication protocol or the tit-for-tat or the “cooperate against similar opponent”, all of these schemes, there’s just no pressure towards adopting them as long as the other player hasn’t adopted them. And so you just get stuck in just doing the naive thing.

Daniel Filan: So, in your paper, you have a pre-training method to address this, right?

Caspar Oesterheld: Yeah, so we have a very simple pre-training method. Basically it’s just: explicitly train your neural nets to basically cooperate against copies, which if you consider more general games, it’s just train them to maximize the payoff that they get if they’re faced with a copy while taking the gradient through both copies. So it’s like you play the game fully cooperatively in some sense.

And you also train them to do well against randomly-generated opponents, which if you have some prisoner’s dilemma-like game, where just basically it just needs to defect against, especially against dissimilar opponents, that’s the pre-training method. And basically the result of the pre-training method is that they very roughly do the intuitive thing, that they cooperate at low levels of difference or high levels of similarity, and the more different they are from their opponent, the more they defect.

So it does this intuitive thing, but it doesn’t do it in a… in some sense it’s unprincipled. In this pre-training process, there’s never any explicit reasoning about how to make something an equilibrium or how to make it stable or something like that. It’s just naively implementing some way of implementing this kind of function.

So that’s the pre-training. And then we do this alternating best response training where then we take two different models that are independently pre-trained in this way and then we face them off against each other. And they typically start out cooperating against each other because they are actually quite similar after applying this pre-training, and then - maybe this is surprising, I don’t know how surprising it is - but the more interesting thing I think is that if you then train them with alternating best response training, they converge to something that’s at least somewhat cooperative. So basically they find a cooperative equilibrium.

Daniel Filan: Yeah. And this is kind of surprising. So, you might naively think… Well, alternating best response is usually, you hold your opponent fixed and then you’re like, “What can I do that’s best for me?” And you might think, what can I do that’s best for me is just ‘defect’, right?

Caspar Oesterheld: Yeah, though I mean it is the case that, if your opponent cooperates against similar opponents and defects against dissimilar opponents, it’s clear that there’s some pressure for us becoming a copy of them, or becoming very similar to them.

Daniel Filan: And you are also taking that gradient?

Caspar Oesterheld: Yeah, one definitely has to take that gradient, otherwise one just learns to defect. I guess the reason why it’s not obvious that this works is just that the pre-training, it’s so naive. It’s such a simple method. It’s much simpler than this opponent-shaping stuff that you described where you kind of think about, “Okay, I have to make it so that my opponent’s gradient is such and such to make sure that I incentivize them to be a copy of me rather than something else.” One doesn’t do any of that. One just does this very simple crude thing.

And so I think what this does reasonably well to demonstrate is that it’s not that hard to find these cooperative equilibria with relatively crude methods.

Daniel Filan: Although I think you said that they didn’t cooperate with each other all the time?

Caspar Oesterheld: Yeah. So the cooperation does unfortunately somewhat evaporate throughout this alternating best response training. So they might initially be almost fully cooperative with each other and then you train them to become best response to each other and then they actually learn to be a bit less cooperative.

Daniel Filan: Okay. So, if the only issue was finding good pre-training or finding a good initialization, and then they have this pressure to become more like “cooperate with things similar to me”, why wouldn’t they cooperate more rather than less over time?

Caspar Oesterheld: That’s a good question. So if one had an optimal pre-training scheme that actually finds a correct way of doing the similarity-based cooperation scheme, so one finds a strategy that’s actually an equilibrium against itself, for example, and then one trains - and let’s say both players do this and maybe they don’t find exactly the same equilibrium, but they find some such strategy, and then we train the opponent to be a best response to your policy. Then what they should learn is just to become an exact copy.

And then once you’re there, you don’t continue. You stop. You’re done. You can’t improve your payoff anymore. Okay, gradients: if you still take gradient steps, it gets complicated, but if you think of it as really just trying to improve your neural net, you can’t improve it anymore if you’re in that equilibrium.

Yeah. So if you had this optimal pre-training scheme then alternating best response training would in some sense immediately, on the first step, it would just get stuck in the correct equilibrium.

So why doesn’t this happen? There are some reasons, multiple reasons I think. So one is just that our initialization just isn’t so good, so I kind of doubt that they are equilibria against each other. I think we tried this at some point to just see what happens if you just pair them up against the literal copy and they also unlearn to cooperate a bit because they just don’t implement the kind of correct curve of defecting more as they become more dissimilar.

Daniel Filan: Right. So, if they just don’t implement the correct algorithm, you just don’t have that pressure to remain being similar and reward similarity?

Caspar Oesterheld: Yeah, you have some complicated pressure. I guess you still don’t want to be completely different, but there’s all this trickery where you’re kind of similar in some ways, but I don’t know, your curve goes down a bit more and you exploit that their curve doesn’t really go down enough as the diff increases. You still have some pressure towards becoming similar, but it’s not enough and it’s not exact.

And then I think the other issue is that the alternating best response training… in some sense it can make them more cooperative. So this isn’t quite true because it can make them more cooperative by making them more similar. So if I have my optimally pre-trained network and then you train your thing to be a best response to mine, then it becomes more cooperative towards mine and mine will become more cooperative towards it.

But if you look at any individual network, it can’t become more… Once it only, let’s say, cooperates with 60% probability against exact copies, once it cooperates only with 60% probability or it’s not exactly cooperative anymore at a diff value of 0, there’s no way to get back from this to cooperating with a 100% chance as long as your opponent isn’t still cooperating at a 100%.

So there’s a way in which… if you imagine this alternating best response training as somewhat noisy, sometimes it by accident makes people defect a bit more, or something like that. Or sometimes it is just better to defect a bit more because the incentive curves aren’t optimal and don’t exactly make it an equilibrium to be a copy. As soon as you lose a bit, you’re never going to get it back. So as a consequence, usually during the alternating best response training, the loss just goes up and then at some point it kind of stagnates at some value.

Daniel Filan: So, naively, I would’ve thought: if at a 100% cooperativity… if when you cooperate with really close copies all the time, at that point it’s worth me becoming a little bit more similar to you. If I nudge your number down to 99%, I’m surprised that it’s so unstable. Or is the idea that it’s only stable in some zone and alternating best response can get outside of that zone? It is just weird to me that I can have an incentive to become more similar to you at one point in your parameter space, but if you deviate from that, that incentive goes the other way. I would expect some sort of gradual transition.

Caspar Oesterheld: Okay. So I mean in general it’s definitely not exactly clear what happens exactly during the alternating best response training, why it finds these partially cooperative equilibria when originally there aren’t any. Why is that? It’s not exactly clear why that’s the case. I mean, I think the reason why originally they’re not cooperative is that… a typical thing is that just their curve is too flat in the beginning in some sense. So they cooperate if they observe a diff value between 0 and 0.1 and they cooperate roughly equally much. And then if you’re an exact copy of this, you would want to defect more just so much that the diff value increases from 0 to 0.1. Okay. It’s not exactly like there’s noise, it’s a bit more complicated, but roughly it’s something like this.

Daniel Filan: You don’t want to be an exact copy at the very least.

Caspar Oesterheld: Yeah, you don’t want to be an exact copy. I mean, if the noise is uniform from 0 to 0.1, then maybe you do want to be an exact copy. But yeah, it’s somewhat complicated. I think that’s the typical way in which they kind of fail in the beginning, is just that they have these too flat curves.

I think another thing is also that they just aren’t close enough copies in the beginning. Typically they become closer copies throughout training, I think, if I remember correctly. Then it’s unclear why, at least not obvious why the alternating best response training causes the curves to be so that it is an equilibrium. I think part of it is just that they learned to defect maximally much or something like that, as much as they can get away with. Yeah, it’s not…at least I’m not aware of a super simple analysis of why the alternating best response training then does find these cooperative equilibria.

Daniel Filan: Okay. So, speaking of things which aren’t super simple, I think in this paper in one of the appendices, you do try this fancier method where instead of just doing alternate best response, you try to shape your opponent to your own benefit. And naively, I might think: ah, this is one of my favorite ways of training agents to play games, and you are trying to shape the opponent to make your life better, so I imagine things will work out better for these agents if you do this, right? But does that happen?

Caspar Oesterheld: Yeah, unfortunately it doesn’t seem to work very well. So that’s this LOLA method that you also talked about earlier. I also went into this with this kind of hope, thinking… And in some sense it is supposed to solve this exact kind of problem: it was developed, for example, to learn tit-for-tat in the prisoner’s dilemma. But somehow, yes, we tried a bunch to get it to work and we couldn’t really. I don’t know. There are some results where it kind of works for a bit and then it unlearns to cooperate again and it seems relatively unstable.

Daniel Filan: Is there any simple story of what’s going on here? Or is it maybe just weird hyperparameter tuning or weird nonsense of strange nets?

Caspar Oesterheld: So definitely LOLA is very sensitive to hyperparameters. So that’s kind of known, that if you take any positive LOLA results and you change the hyperparameters a bit, it’s a pretty good chance that it stops working relatively quickly. But I don’t have a good intuition for why it doesn’t work in this case or even in other cases. I don’t have an intuition for why it’s so sensitive to the hyperparameters and things like that and, I don’t know, why doesn’t it always work pretty straightforwardly?

FOCAL, Caspar’s research lab

Daniel Filan: Fair enough. I might move to some closing questions if that’s okay with you. So, first of all: you are the, I think, assistant director or co-director of this FOCAL lab at CMU, right?

Caspar Oesterheld: Mm-hmm (affirmative).

Daniel Filan: Can you tell us a little bit about that?

Caspar Oesterheld: Yeah, so that’s the Foundations of Cooperative AI Lab at Carnegie Mellon University. So the actual director is Vincent Conitzer, who’s also my PhD advisor while I’m still in the final stages, I hope, of my PhD.

So generally it’s a lab that’s supposed to work on the kinds of topics that we’ve discussed today. As one might imagine from the name it’s part of the CS department. But yeah, we have some more philosophical work as well, for example. Currently we have I think one postdoc and five PhD students

Daniel Filan: Cool.

Caspar Oesterheld: And I think listeners of this podcast who are interested in these kind of topics would be a good fit for this kind of lab. So I don’t know if anyone’s considering starting a PhD [but] I think it might make sense to check it out.

Daniel Filan: Okay. If someone is in that situation, what do they do in order to get into FOCAL?

Caspar Oesterheld: I mean, a lot of the application processes I think are not that different from general CS PhD application stuff. It’s good to have a paper or something like that. Another strategy is to try to work with us before applying. For example, at least in the past few years, I mentored summer research fellows at the Center on Long-term Risk and also at CERI, the Cambridge Existential Risk Initiative. So I guess that’s a way to work with me at least before applying for a PhD, which I think helps, well, if you want to just start working on some of these topics, but maybe also helps with getting in.

Daniel Filan: Sure. So, before we wrap up the show as a whole, is there anything that you wish that I’d asked that I hadn’t?

How the papers all relate

Caspar Oesterheld: Okay, so I guess one kind of question that I expected was that both the ‘Bounded rational inductive agents’ paper and the ‘Similarity-based cooperation’ paper, they touch on this kind of decision theory like Newcomb’s problem, evidential versus causal decision theory, this whole cluster of topics. And so I was expecting to get the question [of] how these relate, which, yeah, I guess I could say some stuff about. Maybe it’s just an excuse to talk even more about various interesting topics.

Daniel Filan: I think listeners will be glad for an excuse to hear you talk more about interesting topics. Yeah, how do they relate?

Caspar Oesterheld: So both of the papers are very explicitly inspired by thinking about these kinds of things. So yeah, I think that one should cooperate in a prisoner’s dilemma against a copy, for example. And I think it’s kind of unfortunate that there isn’t that much of a theoretical foundation for why one should do this in terms of learning, for example; regret minimizers have to learn not to do this, for example.

And so part of the motivation behind ‘Bounded rational inductive agents’ is to describe a theory that very explicitly allows cooperating against copies as a rational thing to do. So that’s somewhat inspired by this.

And then of course with the ‘Similarity-based cooperation’ paper, in some sense it’s even more explicit that it’s supposed to be doing that, though it takes this program equilibrium-inspired outside perspective where one doesn’t think about what it is rational to do in the prisoner’s dilemma against a copy. One thinks about what kind of program it is good to submit in this kind of setting.

And so in some sense one has this question that we can ask from the inside, like: if you are a neural net that was built or learned or whatever to play games well and you find yourself in this scenario - from the inside, from the perspective of this neural net - it’s like you face an exact copy and maybe you reason about things by saying, “okay, if I cooperate, then the opponent will cooperate.”

Daniel Filan: And both this and the ‘Safe pareto improvements’ paper have this quality of: from the outside, you’re making some change to the agent that’s actually making the decisions. And you might think that, shouldn’t this all just happen internally?

Caspar Oesterheld: Yeah, it is interesting that both of these papers take this outside perspective and make… I agree, one would think that one doesn’t need the outside perspective, but at least conceptually, it seems sometimes easier to reason from that outside perspective.

So in particular this program equilibrium framework, in some sense, you can think about this in the general… You can think of program equilibrium or this framework as asking the question, “How should you reason if other people can read your mind and you can read their mind?” And this is really a very hard philosophical question.

And you can avoid all of these questions by taking this outside perspective where you submit a program that gets to read the other program’s mind and then you treat this outside perspective just in the normal standard game-theoretic way. I don’t know, it’s a surprisingly… I don’t know, maybe the other problem is surprisingly hard and so this trick is surprisingly successful.

Daniel Filan: Yeah. So, I guess that’s one interesting relationship between BRIAs and the similarity-based cooperation: the internal perspective versus the external perspective.

Caspar Oesterheld: Yeah. Yeah.

Daniel Filan: And I guess there’s also this thing where with similarity-based cooperation, you’re saying: well, if there’s this difference function, then what happens? Whereas in the BRIA thing, you have a variety of these experts or these hypotheses. And I guess in some sense you’re looking at a wider variety of… Maybe the analogy is you’re looking at a wider variety of these similarity functions as well. You’re somehow more powerful, I think.

Caspar Oesterheld: Or just more generally looking at strategic interactions in a less constrained way.

Relationship to functional decision theory

Daniel Filan: Yeah. There are some interesting questions with these papers, in particular with ‘Safe pareto improvements’. I think relatedly, one thing I didn’t quite ask, but maybe I’ll bring up here: is this just a roundabout way of getting to one of these “functional decision theories” where you just choose to be the type of agent that’s the best type of agent to be across possible ways the world could be?

Caspar Oesterheld: Yeah, maybe that’s the case. I think the trickiness is that the functional decision theory, updateless decision theory, these sorts of things… they’re not fully specified, especially in these multi-agent scenarios. It’s just unclear what they’re supposed to do. I suppose as a functional decision theorist or updateless decision theorist, one might argue that one shouldn’t need surrogate goals, because in some sense the goal of all of these theories is to do away with pre-commitment and things like that, and you should come with all the necessary pre-commitments built in.

And so maybe in some sense, an idealized functional decision theory agent should already have automatically these, well, surrogate goals, except you wouldn’t call them surrogate goals. You just have these commitments to treat other things in the same way as you would treat threats against the original goal to deflect threats.

Daniel Filan: You just reliably do whatever you wish you had pre-committed to do, and hopefully there’s one unique thing that’s that.

Caspar Oesterheld: Yeah, in the face of equilibrium selection, it’s very unclear what that is supposed to come out as.

Following Caspar’s research

Daniel Filan: Yeah. Coming up on the end, suppose somebody’s listened to this, they’re interested, they want to learn more: if they want to follow your research or your output and stuff, how should they do that?

Caspar Oesterheld: There are three things. So I recently made an account on the social media platform X, formerly known as Twitter, and that is @C_Oesterheld, and I mostly plan to use this for work-related stuff. I don’t plan to have, I don’t know, random takes on US elections or whatever.

Then I also have a blog at casperoesterheld.com, which is also mostly sticking relatively closely to my research interests. And then if you don’t want to deal with all this social media stuff, you can also just follow me on Google Scholar and then you just get just the papers.

Daniel Filan: All right. We’ll have links to all of those in the description of the episode. Yeah, it’s been really nice talking. Thanks so much for taking the time to be on AXRP.

Caspar Oesterheld: Thanks for having me.

Daniel Filan: And for listeners, I hope this was a valuable episode.

Daniel Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with the transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev, Ben Weinstein-Raun, and Tor Barstad. To read a transcript of this episode, or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss

Watermarking considered overrated?

Published on July 31, 2023 9:36 PM GMT

Status: a slightly-edited copy-paste of a ~~Twitter~~ X thread I quickly dashed off a week or so ago.

Here's a thought I'm playing with that I'd like feedback on: I think watermarking large language models is probably overrated. Most of the time, I think what you want to know is "is this text endorsed by the person who purportedly authored it", which can be checked with digital signatures. Another big concern is that people are able to cheat on essays. This is sad. But what do we give up by having watermarking?

Well, as far as I can tell, if you give people access to model internals - certainly weights, certainly logprobs, but maybe even last-layer activations if they have enough - they can bypass the watermarking scheme. This is even sadder - it means you have to strictly limit the set of people who are able to do certain kinds of research that could be pretty useful for safety. In my mind, that makes it not worth the benefit.

What could I be missing here?

Maybe we can make watermarking compatible with releasing model info, e.g. by baking it into the weights?
Maybe the info I want to be available is inherently dangerous, by e.g. allowing people to fine-tune scary models?
Maybe I'm missing some important reasons we care about watermarking, that make the cost-benefit analysis look better? E.g. avoiding a situations where AIs become really good at manipulation, so good that you don't want to inadvertently read AI-generated text, but we don't notice until too late?

Anyway there's a good shot I don't know what I'm missing, so let me know if you know what it is.

Postscript: Someone has pointed me to this paper that purports to bake a watermark into the weights. I can't figure out how it works (at least not at twitter-compatible speeds), but if it does, I think that would alleviate my concerns.

Discuss

AXRP Episode 24 - Superalignment with Jan Leike

Published on July 27, 2023 4:00 AM GMT

Google Podcasts link

Recently, OpenAI made a splash by announcing a new “Superalignment” team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode, I’ll be speaking with Jan Leike. After working for four years at DeepMind on reinforcement learning from human feedback and recursive reward modelling, in early 2021 Jan joined OpenAI, where he now co-leads the recently-announced superalignment team. For links to what we’re discussing, you can check the description of this episode, and you can read the transcript at axrp.net. Welcome to AXRP.

Jan Leike: Thanks a lot for having me.

The superalignment team

Daniel Filan: Yeah, not at all. So, first of all, I guess we’re going to be talking about this announcement of the superalignment team. For people who somehow haven’t heard of that or haven’t read that blog post, can you recap what it is and what it’s going to be doing?

Jan Leike: Yeah, I’m excited to. So, basically, we want to set us an ambitious goal of solving alignment of superintelligence within the next four years. So, by mid-2027. And Ilya Sutskever, the co-founder and chief scientist of OpenAI is joining the team. He’s co-leading it with me. And OpenAI is committing 20% of the compute secured so far to this effort, or to the effort of aligning superintelligence. And so, we’re staffing up the effort a lot. We are hiring a lot of people. In particular, we’re interested in hiring machine learning researchers and engineers who haven’t really worked that much on alignment before, because we think there’s a lot of scope for them to contribute and have a really big impact. Yeah, we have a general overall plan of how we want to approach the problem that involves training a roughly human-level alignment researcher that can work automatically, and then ask that automated alignment researcher to figure out how to align superintelligence.

Daniel Filan: Okay.

Jan Leike: And so, one of the key pieces for us to do would be to figure out how to align this automated alignment researcher.

What’s a human-level automated alignment researcher?

Daniel Filan: Okay. Yeah. I’d actually like to get into this. I think in the blog post, you used the phrase “human-level automated alignment researcher”, right? What should I imagine here? What is that?

Jan Leike: Yeah. So basically, we want to offload as many of the tasks that we do when we’re doing alignment work to an automated system [as possible]. So typically, when you’re using LLMs or if you’re building an AI system in general, the skill profiles they have isn’t exactly what a human would do. Right? They would be vastly better at some things, like language models are now on translation or knowing facts and so on. And then the AI system would be significantly worse at some other tasks, like language models are right now with, for example, arithmetic. And so, the question then becomes: what are the kind of tasks that we can offload to the AI systems, in which order? And as we are doing this, you’d expect humans would focus more and more on the tasks that we are not offloading. And so, as we go into that process, AI systems are doing a larger and larger chunk of the overall work, and human researchers will basically thus be more and more effective at actually making progress.

Daniel Filan: Okay. So, should I imagine something like, instead of you replace the first OpenAI alignment team employee, and then the second one, [we] should imagine you replace this type of task that everyone is doing, and then this type of task that everyone is doing, roughly that kind of thing?

Jan Leike: Yeah. That’s how I picture it going. And then, I think in order to actually get a lot of work out of the system, right, you would want to have 99 or 99.9% of the tasks being automated, because then you have effectively 10x, 100x, 1,000x as much research output.

Daniel Filan: Okay. What kinds of tasks are you imagining it doing?

Jan Leike: So broadly, I would throw them into two different buckets. One bucket is the tasks that look more like traditional ML engineering research that you would do if you were just trying to make AI systems be more capable. And then, the other bucket is all the other things that we have to do on alignment. And so, in the first bucket, this is stuff like you’re implementing ML experiments, running them and looking at the results. And the second bucket, it’s more like how do you, for example, figure out what experiments you should run to improve scalable oversight, or how do you make progress in interpretability, right? These are really big, high level questions.

But there’s also just a lot more detailed questions. [E.g. say] you have a given point that you are in research - let’s say, you have just written a paper, and you’re like, “okay, what do we need to do next if we continued down this route?” And so, I expect that, basically, ML in general will get really good at the first bucket of just designing, running experiments automatically. And our job of differentially accelerating alignment progress would be to figure out how to automate the second bucket.

Daniel Filan: Okay. And so you’re conceiving the second bucket as the full stack, from coming up with research directions to coming up with ideas of what things might work to all the way down to “what script do I run right now?”

Jan Leike: Yeah. I mean, you could ask me: if I think that alignment research is so similar to machine learning research, how much is there really in the second bucket? But I think there’s actually a lot in there and it’s highly leveraged, because alignment as a problem is still so vague and confusing. And I think, in general, there’s a lot of disagreement among experts around the most promising directions or what we should do next. And so, the more you can accelerate what we do there, it will actually have really large impact.

Daniel Filan: Okay, cool.

Jan Leike: This is basically the same pitch that you would give for a researcher to join the field, right?

Daniel Filan: Yeah.

Jan Leike: We’re still trying to figure out the basics. It’s a wide open research problem. We don’t know how to align superintelligence or even systems that are significantly smarter than humans.

Daniel Filan: It makes sense. Yeah, it’s like we want to recruit AI, just like we want to recruit more people, I guess.

Jan Leike: That’s right.

Daniel Filan: All right.

Jan Leike: But there’s something really beautiful about recruiting AI, which is it scales so much better and faster than humans do, because all you need to do is buy more GPUs and then you have more AI.

The gap between human-level automated alignment researchers and superintelligence

Daniel Filan: Makes sense. So one question I had is: when you said a human-level alignment researcher, it seems often in AI, most things aren’t exactly human-level at anything, right? So you mentioned chat models. I think they’re superhuman just in terms of breadth of knowledge, right? I think it would be hard for anyone to know as many facts as GPT-4 does. But [they’re] subhuman at arithmetic, at least if the human’s allowed to have pen and paper, you know? So, how important is the ‘human-level’ qualifier? On these lists of tasks, if it’s really superhuman at some of them, is that a problem for you or is that just so much the better?

Jan Leike: Yeah, I think the question is really how risky is it to run that system on the task of alignment research? Because if it knows a lot of facts, that isn’t particularly scary, but what we really need to figure out is if we let the system take over some amount or ultimately almost all of our alignment research, will it lie to us? Will it try to deceive us? Will it try to take the opportunity to take over? Because now, it’s doing so much stuff that we can’t look at [it all] ourselves. And so, the question is the kind of skillset that you would need to do this, how does it compare to the kind of skillset that we would need to get a lot of assistance in alignment research?

And if you zoom into that question, what are actually the things that we would be worried about? This is like, how good is it? Is the model spinning really coherent lies or being deceptive or pretending to do something or believe one thing and then actually wanting another? I think another really key capability here is self-exfiltration. So, how good would the model be at breaking the security precautions and accessing its own weights and trying to copy it somewhere else on the internet? Or persuading an engineer with access to the weights to download them and send them somewhere? And so, we can specifically measure how good the models are at that and then we can compare it to measuring and how good is it at actually helping us with alignment research?

Daniel Filan: Okay. And so roughly the idea is: you want the models to not be too good at these scary tasks?

Jan Leike: That’s right.

Daniel Filan: Yeah. So, this actually relates to a critique of this line of research that basically says: okay, if I want a human-level automated alignment researcher, it needs to be pretty smart, it needs to be creative, right, it needs to think of things that we haven’t thought of yet, it needs to be able to plan towards a goal - I want to get this, so I’ve got to do these non-obvious things in the way, I’ve got to learn things about the world… And it’s also got to be really good at thinking about misalignment, right, in order to solve misalignment problems. And so one might think, “oh, that combination of things, that’s inherently scary or dangerous.” And I guess the question almost is, then: if the task is you’re building something, you’re aligning this automated alignment researcher, do you even have any problems left for it to solve?

Jan Leike: Yeah. I think, ultimately, this is an empirical question. It’s really difficult to know in which order which skills get unlocked when you scale up the models. There’s a lot more work aimed at predicting emerging capabilities now and I’m really excited about that. I think it’ll give us some chance of actually predicting what the next pre-trained model will be like. But I think we can make some high-level arguments. So for example, one thing that is pretty clear is once you have the model [so] good you can hand a lot of alignment research off to [it], wouldn’t it then also be able to just improve its own capabilities? Or it can do a bunch of ML research, so it can run experiments on improving compute efficiency or something. And then, you could use that and pre-train a much more capable model shortly thereafter.

And I think the story sounds, on the surface, appealing. But I think in practice it will be actually a lot more complicated, because you’re not doing big pre-trained runs every week. They usually take a few months, and so it would be a few months before you actually have that. In the meantime, you still get to use the system. I think the other thing is also: there’s kind of an open question of how much low-hanging fruit there still is on actually having compute efficiency wins. And I think, ultimately, the argument here I would make is that, right now, the existing community of people who are trying really hard to get AI to go faster and be more capable is already quite large relative to the alignment community. And so, if you get to automate a lot of these tasks and both communities benefit equally, then I think actually alignment benefits a lot more because it’s a smaller community, and so we don’t have to do these tasks anymore.

Daniel Filan: Sure. So, I took the plan to be something like: we’re going to make this automated alignment researcher and then it’s going to help us with alignment. And given that in order to make an automated alignment researcher, it needs quite a lot of pretty smart, pretty good capabilities, what problems does it need to solve that we wouldn’t have needed to solve in order to get it?

Jan Leike: I’m not sure I understand your question. Maybe I can answer the second part of the previous question that you asked.

Daniel Filan: Sure.

Jan Leike: Which is, what about long run goals? What about creativity? It seems like, to me, at least, that language models or AI in general has proven to be, on average, more creative than humans, I would say. If you look at, I don’t know, the diffusion model images or if you sample from a pre-trained base model or something, there’s a lot of really wild stuff in there and there’s a lot of creativity that I think you would really struggle to get out of a single human or a small group of humans, just because the model has seen the whole range of everything humans have said or all the images on the internet. And so, they can sample actually from the whole distribution. Whereas, individual humans typically can’t.

And then, in terms of long-run goals, I think this is actually not needed at all, because we can hand off pretty small well-scoped tasks to AI systems, that if they really nail those, it would actually be really useful, right? And that could be really small range things, like “here’s the paper that we just wrote, please suggest some next steps or some new experiments to do”. And if you imagine having a really A-star researcher that you can ask these questions, they don’t have to pursue long-run goals, right? They only have to optimize over the next, I don’t know, few thousand tokens. And if they do that super well, then you would get a lot of value out of them.

Daniel Filan: I guess. That seems like it’s in tension with this aim to automate 99.9% of alignment research, right? Because I would think that thinking “what things do we need in order to get an aligned AI?”… I would think that’s a significant chunk of the difficulty of doing alignment research.

Jan Leike: That’s right. So what I wanted to say is the system’s adding a lot of value by really excelling in these tasks, right? And then, what you do is you have a whole portfolio of these tasks. Some tasks will be like “write the code that implements these experiments”, and then there’s another task that’s like “look at the results and tell me what you see, or suggest what to do next”. And now, once you have done these tasks, you compose them using some general recipe, as people do with Auto-GPT or language model programs, where each of the tasks is small and self-contained, and so the system doesn’t need to pursue long-run goals.

Or for example, the recent work that came out from OpenAI on using process-based feedback on math where instead of training on “did the system get the right solution?” and just doing RL on that, you train a reward model from human feedback on every step in the proof. And it turns out, that’s actually much more effective because it gives the AI system a much more fine-grained way to learn and more detailed feedback. Now, is that going to be competitive with doing end-to-end RL on did you get the solution right in the long run? That is very unclear. But at the very least, you can use this kind of broken-down step-by-step setup, get the system to do a lot of really useful things that humans would’ve done, and then piece it together.

Daniel Filan: Yeah. Although even the small tasks… so one of the small tasks you mentioned was “look at results and decide what to do next”. I guess I would’ve thought that in order to do that you have to have the big picture in mind and think “what next project is most useful for my goal of solving superalignment in four years?”, right?

Jan Leike: That’s right. But you wouldn’t do this in the sense that you’re trying to optimize and do credit assignment for four years. It’s probably more like, well, you’re just adding some broader goals and context into your prompt. But when you’re actually doing reinforcement learning or reinforcement learning from human feedback to improve the system, then you don’t have to wait until that research project concludes to decide whether or not it’s good. It’s just, you use the human as reward shaping; you’re just like “does this look like a good direction that seems better than any directions I could have thought of?” or something. And so, I think the overall goal here is not to build the most capable automated alignment researcher that we possibly could with the tech that we have, [but] rather build something that is really, really useful that we can scale up a lot, and most importantly, that we trust is aligned enough to hand off these tasks to.

And so, if we introduce a fair amount of inefficiencies in these processes where we’re essentially sandbagging the model capabilities a whole bunch by training it this way and we are like, “well, we’re giving it these broken-down tasks and now it would execute those, but if we train it end-to-end, it would be more capable” or something. I don’t think that matters as much. So this is typically what people call the alignment tax. And the alignment tax matters a lot if you’re, let’s say, competing in the market with other companies: I’m building a chatbot or something and my chatbot is more aligned but it seems a lot less capable. That would have a hard time competing in the market. But if you have an automated alignment researcher, the automated alignment researcher doesn’t have to compete in the market, it just has to be useful to us. And so we can get away with paying a higher tax because we just don’t have a replacement, or the real replacement is hiring more humans, which doesn’t scale that well.

What does it do?

Daniel Filan: Okay. I guess another way to ask the question I was going to ask previously is: what problems will you want this automated alignment researcher to solve?

Jan Leike: I mean, ultimately it should solve the problem of “how do we align superintelligence?”

Daniel Filan: Okay. So is the idea that we solve the problems of how to align something roughly human-level and then there’s additional problems as it gets smarter and it just starts taking on those?

Jan Leike: Yeah, basically. So I imagine that an actual solution to aligning superintelligence that we and lots of other people actually truly believe in will look quite different from what we do today. If you look at how ChatGPT is aligned today, it’s a lot of reinforcement learning from human feedback. And there’s a widely-shared belief, that I share, that that just won’t scale because it fundamentally assumes that humans really understand what the system is doing in detail. And if the system is doing a lot of alignment research where you’re thinking of millions of virtual human equivalents or something, there’s no way you’ll be able to look at all of it and give detailed feedback. And it’s a really difficult task and you’ll miss lots of important bugs. But the kind of techniques we’re looking at right now to scale this and align a roughly human-level alignment researcher that can do these difficult tasks but won’t do it crazily differently than humans would, are steps or continuations of that, where scalable oversight, for example, is a natural continuation of reinforcement from human feedback.

Daniel Filan: What is that?

Jan Leike: Scalable oversight?

Daniel Filan: Yeah.

Jan Leike: So scalable oversight I would define as generally a portfolio of ideas and techniques that allow us to leverage AI to assist human evaluation on difficult tasks.

Daniel Filan: Okay. So scalable oversight is an example of the thing that you could build off of reinforcement learning from human feedback, often called RLHF.

Jan Leike: Yeah. And so typical examples of scalable oversight are debate, recursive reward modeling, iterated distillation and amplification, automated market making, and so on. And there’s a bunch of ideas floating around. But what I actually wanted to say, to get back to your original question, is I think if we are actually aligning superintelligence, which is this system that is vastly smarter than humans, can think a lot faster and run at a much larger scale, that just introduces a whole lot of other problems, especially because it will be super general, can do lots of tasks, and then you have to figure out how to align it not just on the much narrower distribution of alignment research-y tasks, but everything else. And also, you want to have a much higher degree of confidence that you’ve actually succeeded than you would get with, let’s say, a bunch of empirical evaluations.

And so I don’t really know what that would look like and I don’t think anyone does, but I think it would be really exciting if we can have some formal verification in there. Maybe we’ve figured out some kind of learning algorithm that has theoretical guarantees. I don’t know what even would be possible here and theoretically feasible if you have a lot of cognitive labor that you can throw at the problem. But all of these things are very different from the kind of things that we would do right now or that we would do next. And I also don’t think that a roughly human-level alignment researcher would start working at those problems right away. And instead what they would want to do is, we would want them to figure out how to better align the next iteration of itself that then can work on this problem with even more brain power and then make more progress and tackle a wider range of approaches. And so you kind of bootstrap your way up to eventually having a system that can do very different research that will then allow us to align superintelligence.

Daniel Filan: Gotcha. So once you have these human-level AI alignment researchers running around, do you still have a job? Does OpenAI still have a human superalignment team?

Jan Leike: Yeah, good question. I mean, I would be excited to be replaced by AI to be honest. But historically what typically happens is what we mentioned earlier: the AI assistant does 99% or 99.9% and then we just do the rest of the work. And I think also something I’m really bullish on is making sure that humans stay in the loop or stay in control over what AI systems are actually doing, even in the long run when we’re long past really being able to understand everything they’re doing. And so there’ll be some humans that still have to try to understand what the high level is of what the AI is trying to do. So that wouldn’t necessarily have to be the superalignment team that is at OpenAI now. It might also require a very different skillset than we have right now. But I think ultimately humans should always stay in the loop somehow.

Recursive self-improvement

Daniel Filan: Okay, gotcha. So yeah, actually speaking of… I guess the loop? Wow, that’s a bad segue, but I’m going with it. So one thing that OpenAI has mentioned in a few blog posts is, so firstly there’s this idea that safety is actually pretty linked to capabilities. You need smart models to figure out the problems of alignment. Another thing is that there’s this desire to avoid fast takeoff. And I think there’s this quote from ‘Planning for AGI and beyond’ that says “it’s possible that AGI capable enough to accelerate its own progress could cause major changes to happen surprisingly quickly”. And then it says “we think a slower takeoff is easier to make safe”. So one thing I wonder is, if we make this really smart or human-level alignment researcher [so] that we then effectively 10x or a 100x or something the size of the alignment team, does that end up playing into this recursive self-improvement loop?

Jan Leike: I mean, it really has to, right? You can’t have a recursive self-improvement loop without also improving your alignment a lot. I mean, I personally think fast takeoff is reasonably likely and we should definitely be prepared for it to happen. And if it doesn’t happen, I’m happy about that.

Daniel Filan: How fast are we talking?

Jan Leike: I mean, ultimately, I don’t know. But you can draw some parallels to some other machine learning projects like AlphaGo or Dota or StarCraft where the system really improved a lot week over week. Yeah, there’s obviously a lot of uncertainty over what exactly happens, but I think we should definitely plan for that possibility. And if that does happen, a really good way to try to keep up with it is to have your automated alignment researchers that can actually do thousands of years of equivalent work within every week. And there’s just no way that that’s going to happen with humans.

How to make the AI AI alignment researcher

Daniel Filan: Okay. Now that we have a better sense of what we’re looking at with a human-level automated alignment researcher, can you give a sense of what the plan is to make one?

Jan Leike: Yeah, so I mean, you basically need two parts. One is you need a system that is smart enough to do it, and then the second part is you need to align it to actually do it. And I think they’re not two separate parts, I think they’re very intimately linked. And I’m personally not working on the first one, I think that will definitely happen and a lot of people are working hard at making it happen. And there’s lots of different ways that could go, but you could just picture training a much larger model and eventually it will just be smart enough. And on the second part, that’s the part I’m really interested in. And so the question is: now I have this really smart pre-trained model, how do you get it to actually do alignment research in the way that you would want it to? Or more importantly or more critically, how do you know whether you can trust it enough to do that?

Because if you’re handing this kind of task off to AI, there’s this opportunity that a smart AI system could just try to figure out how to present results and techniques to you in a way that if you then decided to implement them would actually empower the system itself or some other related system and give it more opportunity to have more power-seeking behavior in the future, or let’s say self-exfiltrate, for example. And so what we need to do is, one, actually align it to do the task and leverage its capabilities to do it basically as well as it could. So that is what I would call, we need a bunch of alignment training methods. And then we also need to be able to distinguish a model that is actually aligned on the task and tries to do alignment research really well and actually help us and be honest and be truth-seeking, and a system that pretends to be that to us but actually wants to take over the world or (say) self-exfiltrate or do something else. And these are two different systems, one is basically the truly aligned one and one is the deceptively misaligned one.

And an important part of our job is figure out how to distinguish these two systems. And that will require a bunch of methods that I would call validation methods. So these are methods that we wouldn’t really train on in a sense that the AI systems wouldn’t be directly selected on passing the validation tests, and more they will give us some kind of independent signal of “which one of these two versions is it?” And so examples here are interpretability research, for example. If I can look inside the model and find some evidence that it’s secretly plotting against us, that will fail the validation test. Another technique that could either be training or validation, depending on how it’s set up, is this easy-to-hard generalization problem. Can you understand and improve the model’s ability to generalize from easy problems that we can supervise to harder problems that we struggle to supervise? And if you can do that, now you can just supervise the model on easy parts and then you crosscheck on the difficult ones where you would worry that the model might secretly deceive you or try to do something else.

Scalable oversight

Daniel Filan: Okay, cool. So yeah, there’s a few things there. I guess I’d like to talk about the scalable oversight part first. How are we going to do this? Especially, as background, I think we live in a world where there’s not a lot of consensus on what good alignment research is. So how are we going to get a training signal for what counts as good alignment research?

Jan Leike: Yeah, exactly. And I think that the reason that there isn’t a consensus is good evidence that the problem is actually hard. But also it tells us that, well, the field is still immature and we haven’t gotten that much empirical evidence so far. But I think there’s some really important property that alignment research has that we can leverage for scalable oversight, which is that I think it’s fundamentally easier to evaluate alignment research than it is to do it. And [that] doesn’t mean it’s easy to evaluate, but it is much easier to find, for example, a paper that has a cool idea and does some cool experiments and gets some good results and you read that and you’re like, “Oh, actually this is a really good idea, I think this is good,” than it would be to produce that work in the first place.

And so leveraging this principle that evaluation is easier than generation is the core of a lot of scalable oversight ideas. And so for example, if you’re thinking about recursive reward modeling, where the basic idea is you have some kind of AI system that you use as an assistant to help you evaluate some other AI system. So now, because evaluation is easier than generation, the task that the assistant has to do is a simpler task, especially because you are working together with it. So actually aligning the assistant on the simpler task of the evaluation assistant, where let’s say the claim is that that would be easier, and if you succeed at that task, then now you use your human/assistant conjunction to supervise a new system on an even harder task. And so if you keep doing that, you’re unlocking a broader and broader scope of tasks that you can effectively supervise your AI systems on.

Daniel Filan: So somehow the idea is that we’re iteratively adding more and more AI knowledge basically to the evaluation portion of AI alignment research. And by doing it this iterative way, the idea is that we can consistently give it a good training signal?

Jan Leike: Yeah.

Daniel Filan: Okay.

Jan Leike: Let me make this a little bit more concrete. So for example, RLHF is the simplest one where you don’t use any assistant. Your AI system does something, a human looks at it and says whether it’s good, and that’s your training signal.

And now, the next step up from that is, basically, you train the simplest assistant model, which is just a critique model. You have a separate language model that looks at what the first AI system did, and then it writes a critique. So let’s say the first AI system writes some piece of code, you look at the code: humans are notoriously bad at finding bugs in code, that’s why there’s so much buggy code out there in the world, but now, if you have a critique system that writes a critique and it points out the bug, it’s so much easier for the human to say, “Oh yeah, this is totally a bug. We should fix that”.

And it’s also not a super crisp task, necessarily, because usually, the code is written for some natural language specification. What is actually meant by that is somewhat vague, and actually, whether something is a bug, can be ambiguous. But the important point here is that, because you’re using the critique model as an assistant, and if [you have] the critique model, now, you can supervise a wider range of tasks, because essentially, you could find all the problems in code that your AI system would find or would’ve written a critique, which might not be exactly the same as all the code where it knows there’s a bug in there, but you have expanded the range of things that you can supervise.

The really nice thing is that, I think, actually, there’s a bunch of ways in which we can actually empirically measure how well this works. There’s one way, which is kind of what we did in the critiques paper that we published last year, where you basically do randomized controlled trials with targeted perturbations. So you take, let’s say, a coding task. In the paper, we didn’t do a coding task, but the task doesn’t matter here. And then, you have a response. The response does not even have to be perfect, but let’s just assume it is. What you do is you introduce some subtle bug in there, some subtle flaw that would make it a bad response, but you want it to be hard to find for humans.

Daniel Filan: How do you introduce this bug?

Jan Leike: The simplest thing is just you ask humans to do it. But ultimately, you want to do it with the model, and I’ll get back to that. Now, you have this paired response. You have the original response, and you have the response that you know is worse because you made it worse. And now, you draw one of them at random and you show it to a human or a human assisted with AI or whatever your scalable oversight method is. And then, you see how likely are they to find it? Or you just ask them, “Is there a bug here?” And now, you can measure, if you have a better way to assist them, you can see how often they find the bug.

But you can do even more, which is, you can directly compare this to how much does the model know that there’s a problem here by training it, essentially, to be a discriminator between the good version and the flawed version. And if you look at the accuracy for that, that’s the discriminator performance. And then you check that with… if you ask the model or the RLHF version of the model to write a critique of the code, how often does it actually write about the flaw? Now, you get this critique accuracy equivalent, and now you can compare these two numbers. That’s what we call the discriminator-critique gap.

And if you have a large discriminator-critique gap, that means that there would be a lot of problems that the model can smell or knows about or whatever metaphor you want to use, but it’s not telling us because if we asked it to write a critique, it doesn’t say it. And a large discriminator-critique gap would be very worrying because it means that your scalable oversight is somehow very flawed and there’s a lot of things that slip [under] your radar, or the assistant is, in a sense, not very aligned because it’s not trying very hard to help you.

And if the discriminator-critique gap is pretty small, that is somewhat reassuring because it means, if the model knew there was a problem there, it would tell you. So in particular, if a similar sized model or the same base model was trying to plant a problem, or it was trying to, let’s say, put Trojans in your code that is hard for you to find, but if you did run the code, the Trojans might fire, and then it could take over or do something else. Then, presumably, it would know that these Trojans could be there because it planted [them] itself, right? So if we have this other copy of the model, then, if we measure a small discriminator-critique gap, we’d have more confidence that this can’t be the case because the other copy of the model would presumably then write a critique.

Daniel Filan: Yeah.

Jan Leike: And this kind of hinges on the assumption that discriminator training actually can take out of the model this kind of thing. And I think there’s good reason to believe that because, if you’re fine-tuning for a discriminator, essentially, you have a way to access all of the activations that are everywhere in the model because you can fine-tune all of the weights simultaneously. So if the knowledge is somewhere in the model, in its model’s activations, then a discriminator fine-tuning on, if you do it right and you have enough data, it should be able to extract it.

And in fact, you can give some back-of-the-envelope calculations of how much data you would need. And if you want to identify one neuron that… This is not how it works, but hypothetically, if there was one single neuron that knew if the code is bad, you would need the log of the number of total neurons data points to narrow it down and find it, which is not that many.

Daniel Filan: Yeah. I guess this actually gets to a question I have about the whole scheme. The reason we didn’t like RLHF is that there was this concern that it wasn’t going to distinguish between things that were right and things that sounded good to the human, because the human was bad at critiquing.

And with this recursive critique thing with training a discriminator and then a critiquer, I wonder if we still have that fixed point. So one world is, we have a thing, we train it to write actually-good critiques, and then, we join ourselves with it to train a thing that writes even better critiques. But I worry, maybe there’s another fixed point where we train a thing and we accidentally train it to write pseudo-critiques that sound good to us and we join with it, and then, we’re even better at looking for critiques that sound good to us. And then, we trained this discriminator to discriminate between things that don’t have problems and things that seem like problems to us. I’m wondering, yeah, how worried should we be about things that seem good to us rather than things that are actually problems being a fixed point here?

Jan Leike: Yeah, I think that’s a really important, valid concern to have. And in a way, that’s one of the key things we need to figure out about scalable oversight. This is the reason why I’m excited to actually implement these techniques and try them out empirically because I think we will just be able to measure how much this effect shows up in practice.

And one concrete, simple way we can measure it is, we actually just deliberately train a deceptively-aligned model, and then, we see if the scalable oversight flags it, or how hard is it to… or how much effort do we have to put into making the model deceptively aligned such that it would pass our scalable oversight.

Daniel Filan: Yeah, I guess there’s also these sandwiching experiments where you have an AI that’s trying to answer legal questions. There’s some people who don’t know much law, and then, you get some people who are really good at law and hopefully, they can check if the AI has fooled the people.

Jan Leike: Yeah. I think these experiments are interesting, and there’s some interesting work on this already. I think there are some fundamental problems to this setup.

One is, well, you really have to trust your experts, and the experts are going to be wrong sometimes. And then you have to have sufficient domain overlap. We did a little bit of preliminary experiments with this where we had non-coders who were supposed to use assistance to figure out problems in code, and they didn’t even know how to ask the right questions. So you have to have some overlap.

And then, the other problem is also that, I think the task that I’m fundamentally really interested in, we just don’t have ground truth on. And even if we, let’s say… alignment research, we’re doing alignment research now, we can be like, “Oh, we’ll just get some alignment researchers to label some tasks.” And then I’d be like, “Well, what if they are wrong? We don’t agree about so many things. That seems hard.” And then, also, there’s not that many and their time is very expensive, so that would be extremely expensive data.

So in general, I would want to have an evaluation method that doesn’t rely on the assumption that we have ground truth. And this is why I’m excited about the prospect of using these randomized controlled trials with targeted perturbations or measuring the discriminator-critique gap. Because you can do this even if you don’t have any ground truth, and the task can be arbitrarily hard.

Daniel Filan: Yeah. Although, even in those - measuring the discriminator-critique gap, it’s still got to be the case that you have an actual discriminator versus a discriminator of things that seem like problems versus things that don’t seem like problems, right?

Jan Leike: You mean you can have AI systems introduce the flaws, right? And in a way, that is also much better than humans doing it because it’ll be a more honest distribution for actual[ly] what AI systems would do. And if you’re fine-tuning your discriminator on that data, you actually do have ground truth if you can trust that the flawed version is actually worse.

Daniel Filan: And I guess it’s-

Jan Leike: And the way you could trust that is, well, you look at the reason why it’s worse and you can verify that reason, which is a lot easier.

Daniel Filan: Yeah, I guess there’s some hope that, even if the AI can make you think things are good that aren’t actually good, maybe if the AI makes you think something’s bad, it is actually bad? Or that it’s degraded the performance, that if the AI makes you think it’s degraded performance, maybe it’s easier to check that it’s degraded performance?

Jan Leike: Yeah, I see your point. I probably shouldn’t actually use the word ground truth in that setting because it is not truly ground truth, just as nothing really is truly ground truth, but there’s a variety of things that you can do to make you very confident in that in a way that doesn’t necessarily make the task of finding the problem easier.

Searching for bad behaviors and internals

Daniel Filan: Okay. Gotcha. I think I want to talk a little bit about the next stage of the plan. Searching for bad behavior and bad internals, I think, was how it was put in the superalignment post. What open problems do you see here that you’d want the superalignment team to be able to solve?

Jan Leike: Yeah. The obvious candidate here is interpretability, right? In a way, interpretability is really hard. And I think, right now, we don’t really have any slam dunk results on language models where we could say interpretability has really given us a lot of insight or added a lot of value. And this is because we’re still pretty early in understanding the models and what’s going on inside.

Daniel Filan: So there are some results… people do some interpretability things to language models. There’s this induction heads work, there’s this indirect object identification work for a circuit that does at least a certain type of indirect object identification. I’m wondering, what would you need beyond those to get what you would consider a slam dunk?

Jan Leike: Yeah, sorry, I don’t mean to diminish existing work here, and I think–

Daniel Filan: You can score in basketball without getting a slam dunk, right?

Jan Leike: Yeah. And I think very cool things have happened, but I think something that I would consider a really cool result is, you use your interpretability techniques on a language model reward model like a GPT-4 size or whatever your largest model is, and then you say something about the reward model that we didn’t know before. And that would be really cool because the reward model is what provides the training signal for a lot of the RLHF training, so understanding it better is super valuable. And if you can flag or find problems of behavior that it incentivizes that you wouldn’t want, that would be really cool.

I think that is doable. And I think, importantly… In that sense, I think interpretability is neither necessary, nor sufficient. I think there is a good chance that we could solve alignment purely behaviorally without actually understanding the models internally. And I think, also, it’s not sufficient where: if you solved interpretability, I don’t really have a good story of how that would solve superintelligence alignment, but I also think that any amount of non-trivial insight we can gain from interpretability will be super useful or could potentially be super useful because it gives us an avenue of attack.

And if you think about it, I think it’s actually crazy not to try to do interpretability because, in a way, you have this artificial brain and we have the actually perfect brain scanner, right? We can zoom in completely and fully, accurately measure the activation of every single neuron on each forward pass, so in arbitrary, discrete timestamps. That’s the maximal resolution that you could possibly want to get. And we can also make arbitrary interventions. We can perturb any value in the model arbitrarily how we want. That gives us so much scope and opportunity to really do experiments and look at what’s happening. That would be crazy not to leverage.

But at the same time, the reason why it’s very hard is because the model is learning how to do computations tailored to efficiency, and it’s not regularized to be human-understandable, or there’s no reason to believe that individual neurons should correspond to concepts or anything near what humans think they are or should be or what are familiar to us. And in fact, empirically, neural networks represent many different concepts with single neurons and every concept is distributed across different neurons. So the neurons aren’t really what matters here.

But one path… I think there’s two things on interpretability that I’m super excited about. One is the causal version of it. You want to not only look at a neuron as you’re passing data through the model and say, “Oh, this fires when we have stories about Canada,” or something. This is one of the things that we found in the interpretability paper. We had this ‘Canada’ neuron that would fire at Canada-related concepts. But it’s only correlationally. You observe a correlation between Canada-related concepts and that neuron firing, but it’s not a causal relationship. In order to check that it’s a causal relationship, you have to deliberately write text that has Canada-related concepts, see if they all fire, but also put in other related concepts that maybe sound like they could have something to with Canada or don’t have anything to do with Canada, but are similar, and then check that the neuron doesn’t fire. Or you take a text, and then edit it and check that the neuron turns off, and things like that.

Daniel Filan: Yeah. I guess this is reminiscent of this, I think it was called the interpretability illusions paper, where they noticed, on Wikipedia, you can have neurons that fire on one specific thing, but then, on other data sets, that was just an illusion, it fires on a bunch of other stuff.

Jan Leike: Yeah. And I think the other thing that is really exciting is, the work that we’ve started last year where we released a paper earlier this year on automated interpretability. And here, the idea is, basically, what you would want is, you would want to have a technique that both can work at the level of detail of individual neurons so that you can really make sure you don’t miss any of the details, but also can work at the scale of the entire model. Because at the end of the day, everything works together and everything is highly connected in the model, so you need both. And so far, techniques have mostly worked on one or the other. There has been attempts at… there’s been work on automated interpretability before our paper, so it’s not like we were the first ones. But I think, in general, if you can have some really detail-oriented interpretability work, some kind of mechanistic interpretability approach that really tries to understand individual circuits or computation units inside the model, the way to then scale that to the entire model is, you need automation, right?

Daniel Filan: Yeah.

Jan Leike: But you can do that. Once you’ve figured out how to do it on the detail, then, you just record what you’re doing. I’m simplifying a bit too much here, but I think that’s the general scope that I would be excited about, also because it really fits with our automated alignment goals. We want to have our automated alignment or interpretability researcher really look in detail into the model, understand what’s going on. Then, we sift through the whole… everything, or we find ways to aggregate it.

So for the paper, we had a dump of explanations. The paper does, it writes natural language explanation for individual neurons, which is not quite the right thing to interpret, as I mentioned previously, but it gives you a simple example of what we could do here. The way it works is you just show a bunch of activation patterns to GPT-4 and you have GPT-4 write a proposed explanation.

And in general, these explanations aren’t very good, also, because the task is so hard and most neurons don’t do a very clearly human-understandable thing. But we can just run this on the scale of every neuron in GPT-2, and then just dump all the explanations, and then try to find what is interesting patterns. And you can look at scaling trends and be like, “How does automating scoring of these explanations scale as models get larger?” Or, “What if we add more compute or we make the model that’s doing the interpretation bigger? How does the quality of explanations change?”

And the really cool thing is that we can just measure this automatically, again, using language models. And it’s not a perfect metric and there’s a bunch of problems with it, but it gives you a proxy for “would a human think this explanation is good?” And then, you use this proxy at scale and just run it on a ton of neurons.

Daniel Filan: Yeah. So, if you think about the necessary interpretability work, how much of it do you think is just figuring out a better fundamental unit of interpretation versus figuring out how to scale what’s happening? Or other things you might…

Jan Leike: I think it needs to be both. I think it’s very reasonable to believe that the former is much harder, but I think the scaling part is needed for it to succeed.

Deliberately training misaligned models

Daniel Filan: Gotcha. Yeah, so, I guess part 3 of the plan was something like deliberately training misaligned models and seeing if the pipeline could detect those. Are you planning to weed them out or actively fix them?

Jan Leike: The goal here would not be to fix a deliberately trained misaligned model.

Daniel Filan: Okay. Just to detect?

Jan Leike: Yeah. Fundamentally, one core aspect of what we need to do here is, we need to be able to distinguish between the actual aligned alignment researcher that does what we want and that truly wants to help us make progress on alignment and the one that pretends to do that, but secretly wants to take over and self-exfiltrate. And one way we can figure out more about how good we are at distinguishing these two is, we make the second one and see if it looks like the first one to us. Obviously, you have to be very careful with doing that because you’re kind of making the scary thing that we’re all trying to avoid, right?

Daniel Filan: Yeah. I actually wanted to ask about that. I think at the start of the post, there’s one sentence about how great superintelligence could be, and then, it says, I think, that it “could lead to the disempowerment of humanity or even human extinction.” What are we talking about with misaligned models? How do we know they’re not going to do those awful things?

Jan Leike: Yeah. Let me put it this way. There’s a lot of things that are needed to make the future go well. We need to have the right governance structures so that we can prevent misuse of powerful AI systems, and we have to have precautions so we don’t run into some kind of emergent runaway effects that are still bad for humans where you have aligned AI, but AI is aligned with corporations who are not aligned with humans or something, and then, the whole system goes off the rails, but it looks good in all the economic metrics or something.

Also, we need to solve technical alignment so we can actually align AI systems. The risks that superalignment is looking at is just the last part. We want to figure out the technical problem of how to align one AI system to one set of human values. And there’s a separate question of what should those values be? And how do we design a process to import them from society?

And I think these are really important questions and we really have to answer them too for the future to go well. And I think, for the future to go well, the bar is not “AI doesn’t kill literally everyone”, but I want there to be lots and lots of human flourishing and I want to have a future that everyone is really excited to live in. And the more powerful AI gets, the higher the stakes will be, right? Because there’ll be more risk from misuse and if systems are misaligned, things can go wrong a lot more.

In particular, the threat models that I think we should be particularly worried about from a misalignment perspective are systematic lies and deception and self-exfiltration.

Daniel Filan: Okay. When you say, when you’re deliberately training misaligned models, is the idea that you would train them to lie and potentially self-exfiltrate, but at a low enough capability level that it’s fine or something?

Jan Leike: You wouldn’t want to train the model, actually, to self-exfiltrate because you know, what if it actually gets good at that, right? You would want to pick some other proxy task that is a lot more mundane, but that still tells you that… It makes it easy to evaluate whether or not the model is succeeding at it.

And then, the core difficulty and the core goal is really figuring out how hard it would be for a system to have the properties that it would need to still pass the evals. If you have a system that is lying to you or deceiving you, that’s really ineffective if it only does so sometimes, and then, if you ask it, “Oh, did you lie about this?”, and it says, “Yes,” it wouldn’t be a very coherent, believable liar. The scary systems are the really coherent and believable liars that lie super consistently where it would just be really hard to interrogate in a way that points out its lies or any sort of inconsistencies in what it’s saying.

And I think, actually, to be this kind of consistent liar is not that easy. And our whole goal with what we’re trying to do with scalable oversight and interpretability and generalization is that we want to make it really, really hard for the model to be this consistent liar. But if we just do that, we don’t know how hard it actually is.

And one way we can find out is, we train the model to be a really consistent liar, and then, we see how hard it is. How much do we have to cheat? How much do we have to give the model hints or handhold it along the way? But you don’t actually want to train a dangerous system that, if you did run it, it would successfully self-exfiltrate, because then, you’ve caused the very thing that you’re trying to prevent, and that would be really bad.

Daniel Filan: Yeah. So when you say that, are you distinguishing lying from other kinds of danger, such that we could train it on lying, but we’re not going to train it on murder, for instance?

Jan Leike: Yeah. Let me try to give you an example. Let’s say you’re giving it some fake goal. Maybe it is to persuade a human to say a certain catchphrase, or maybe it is to run a certain Python command that, in fact, doesn’t do anything but it looks interesting or something.

Daniel Filan: Yeah, yeah.

Jan Leike: And you can very well… it’s so easy to measure whether the system succeeds and you can set up the experiment so that it’s hard to do that a priori. And the goal here is not to have the system stress test, let’s say, our cybersecurity or something, although we should also do that separately. I think that’s another effort.

Here, the goal is really just, how close are the systems that we’re currently training, or that we currently have, to a system like that that would be deceptively aligned, that is a coherent liar, that any chance it gets where it thinks humans aren’t looking, will run specific code or things like that. That’s exactly what we are looking for. And a priori, it’s really hard to get confidence of how close we are, unless we just push it a little bit in that direction and see if that happens.

Daniel Filan: So the thing I’m imagining now is, you’re training it to do something kind of arbitrary other than the thing it’s supposed to be trained for such that it has to do deceptive stuff, it has to do the kinds of things it would need to do in order to cause damage, but at the end of the day, it’s just going to put a sticky note on someone’s forehead or something.

Jan Leike: Exactly, exactly. It’s like you’re hiring someone to test your physical security, and all I have to do is get in the building and give you a handshake and you’re like, “Yeah, it seems like you succeeded at it,” or something. Or you’re like, “Okay, can you steal this fake object for me because I want to figure out how good our security is,” things like that that you can do that doesn’t have actual consequences, but still tells you a lot about your security. I’m excited to do the same with alignment where you stress test the alignment systems you have by training something that’s particularly targeted to break and circumvent them, but if it succeeds at doing that, it doesn’t cause actual harm. It’s very benign.

Four year deadline

Daniel Filan: Okay. Cool. I’d like to ask an overall question. So in the superalignment blog post, in big letters, it says, “Our goal is to solve the core technical challenges of superintelligence alignment in the next four years.” What do you mean by the core technical challenges of superintelligence alignment?

Jan Leike: Yeah. This would be the general technical measures, how to align a superintelligence with a set of human values. And the kind of superintelligence we’re picturing here is a system that is vastly smarter than humans, that could possibly execute much faster. You can run it a lot in parallel, it could collaborate with many, many copies of itself, so a truly vastly powerful system.

And we want to do this in four years. And the reason why we picked the four years was, we want to have a really ambitious goal, something that feels like we can actually achieve it, and also something that we can still realistically use, even if AI progress actually happens really fast and the technology actually improves a lot over the next few years, so that we still have something that we can deliver.

Daniel Filan: Gotcha. And just to be clear, it sounds like this isn’t just building the human-level automated AI alignment researcher. It’s also using that to just get the technical problems of aligning something much smarter than us.

Jan Leike: That’s right. The roughly human-level automated alignment researcher is this instrumental goal that we are pursuing in order to figure out how to align superintelligence because we don’t yet know how to do that.

Daniel Filan: Gotcha. So if, four years from now, you want to have solved these core technical challenges, where do you want to be in two years that would represent being on par to meet that goal?

Jan Leike: Yeah. If you want to work back from the four years, I think, basically, probably, in three years you would want to be mostly done with your automated alignment researcher, assuming the capabilities are there. If they’re not there, then, our project might take longer, but for the best reasons.

And then, the year before that, we already want to… so this would be in two years, we’d want to have a good sense for what are actually the techniques that we could use to align the automated alignment researcher. Do we have the portfolio of techniques that, if we did apply them, we would actually feel confident that we have a system that we can trust and we can use a lot, and then we can hand off a lot of work to? And then, basically, at that point, we would have wanted to have broken down the problem enough so that it feels like now, the overwhelming amount of work is just engineering, which, in that sense, would leave us roughly two years to figure out the research problems attached to it.

Now, this is kind of a timeline for this goal that we set ourselves with four years, and obviously, there’s really important interactions here with how AI capability progress goes. Because if progress slows down a lot, then we might just not have a model that’s good at any of the really useful alignment research tasks. We’ve tried a whole bunch to do this with GPT-4, and GPT-4 was just not very useful. It’s just not smart enough of a model. That’s the short version of it, but I’m happy to elaborate. And so if four years from now we’re just like, “Well, we didn’t have a model that was good enough”, then also that means that we have more time to actually solve the problem because the problem isn’t as urgent. And on the other hand, it could also be that AI progress is a lot faster and we are like, “Well actually we don’t think we’ll have four years because superintelligence might arrive sooner.” That could also be possible. And then we have to adapt our plan accordingly. And so the four years was chosen as a timeframe that we can actually probably realistically plan for, but also where we have enough urgency to solve the problem quickly.

What if it takes longer?

Daniel Filan: Yeah. So a question I think a few people have wondered about, and I’m curious about as well is: so suppose that on the AI capabilities research front, it goes roughly as expected. Four years from now, you have all the capabilities to have something that could be a good automated alignment researcher, but interpretability turned out harder than we thought, or scalable oversight turned out harder than we thought, and you guys haven’t made the mark. What happens then?

Jan Leike: Well, we have to tell the public that we haven’t achieved our goal, right? So, we are accountable to that goal. But also what happens if we don’t make the goal depends a lot on what does the general state of the world look like. Can we get ourselves more time somehow or was our general approach misguided and we should pivot or…? A lot of things can happen.

Daniel Filan: But basically: let people know and then figure out what the good next thing to do is and do it, roughly?

Jan Leike: It’s a pretty high-level answer.

Daniel Filan: Yeah.

Jan Leike: But I think there’s something more here, which is: at least to me, it feels like the alignment problem is actually very tractable. I think there’s a lot of good ideas out there that we just have to try rigorously and measure results on and we will actually learn and be able to improve a lot. And I’ve significantly increased my optimism over the last two years that this is a very practical problem that we can just do. And I think even if I turn out to be wrong and it did turn out to be much harder than we thought, I think that would still be really useful and be evidence about the problem, and in particular I think right now there’s a lot of disagreement over how hard the problem actually is. And maybe more importantly, I think there is… One of the really important things we need to know, and be able to measure, is how aligned are our systems actually in practice?

And I think one of the things I worry about the most is not that our systems aren’t aligned enough, but that we don’t actually really know how aligned they are. And then experts will reasonably disagree about it because if everyone agrees the system isn’t aligned enough to deploy, it won’t get deployed. I think that’s a pretty easy case, and the kind of scary scenario is more like you have this capable system and you’re like, “Well, it’s probably fine. It’s probably aligned, but we are not quite sure”. And there’s a few experts who are still quite worried and then you’re like, “Well, we could deploy it now, but let’s not”. And at the same time there’s this strong commercial pressure where you’re like, “Well, if we deploy the system it would make a billion dollars per week” or something crazy. And you’re like, “Okay, there’s still–”

Daniel Filan: Should I take a billion dollars per week as OpenAI’s official projection?

Jan Leike: laughs

Daniel Filan: Okay, sorry. Anyway, so there might be pressure to deploy something.

Jan Leike: Exactly. Because you’re like, “Well, if we deploy it later, it’ll cost us effectively a billion dollars”. And then you’re like, “Okay, experts are still worried, so let’s make the call and hold off from deployment for a week”. Then a week goes by, another week goes by and people are like, “Well, what now?” And then alignment experts are like, “Well, we looked at some things but they were inconclusive, so we’re still worried and we want to delay it bit longer”. And I think this is the kind of scenario that I think is really worrying because you have this mounting commercial pressure on the one hand and you’re pretty sure, but not quite sure. And so that’s a scenario I would really love to avoid. And the straightforward way you avoid it is you just get really good at measuring how aligned the systems actually are.

Daniel Filan: Gotcha.

Jan Leike: And that’s where a broader portfolio of techniques is really useful.

Daniel Filan: Yeah, I guess it’s almost even worse than that in that if you’re in a situation where you could be using your AI for alignment research, then errors of not deploying are actually costly in terms of safety, potentially, it seems like.

Jan Leike: That’s right.

The superalignment team and…

… governance

Daniel Filan: So I actually wanted to pick up on this thing you mentioned about just getting a bunch of good measures. So a thing that I think a few OpenAI blog posts have talked about is this idea of audits of AI systems and, I don’t know, trying to figure out “are they going to be safe?” To what extent do you expect the superalignment team to work on things that would be useful for audits?

Jan Leike: I mean, I think I’m somewhat hopeful that some of the techniques that we’ll produce might be good for audits if we do our job well. For example, if we manage to make some progress on interpretability, then an auditor could use whatever techniques we come up with as part of their audit. Or I think making some kind of scalable oversight part of an audit is a really natural thing. But I think also to some extent the superalignment team is not ideally positioned to do an audit because we are not independent of OpenAI.

And I think an audit truly needs to be independent of the lab that is being audited. And so you want to have… and this is what makes me really excited about having independent auditors, because you want someone else to double-check what you’re doing. And I think more generally, our main responsibility is not to convince ourselves that the system we’re building is aligned and safe, because it’s so easy to convince yourself of all kinds of things that you’re incentivized to be convinced of, but really we have to convince the scientific community or the safety community that the system we are building is actually safe to run.

And that requires not only the research that leads to the techniques that we’ll be using and the empirical evidence that the system is as aligned as we think it is that we can then show to others, but also independent assessment of all of the above.

Daniel Filan: So would it be right to say, even just broadly about governance efforts, that you maybe think that ideally this team would produce techniques that are relevant to that effort, but independently people need to do that and you’re just going to focus on solving technical alignment?

Jan Leike: Yep. So this is another way you can put it right? We really want to focus on solving the problem and we want to make sure the problem is solved and that we can actually implement the solution on the cutting edge systems, but we still need independent assessment of “did we succeed at that?” or “how dangerous are the models’ capabilities?” and those kinds of audits. But to some extent we have to evaluate the alignment too. That’s the problem that we have to face, but specifically what we want to do is solve the problem.

… other OpenAI teams

Daniel Filan: Yep. Makes sense. I guess branching off from there, how do you see your team as relating to - OpenAI, I guess it has an alignment team, or at least it had an alignment team - is that still existing by the way?

Jan Leike: So the alignment team, as it existed last year, had two parts, and one of them was called ‘practical alignment’, one of them was called ‘scalable alignment’. Practical alignment’s mission was roughly aligning OpenAI’s most capable models. And so that team focused a lot on aligning GPT-4. And then the scalable alignment team’s goal was figuring out alignment problems that we don’t have yet. And so what’s happened with the ChatGPT launch and all the success is it became a big product and there was a lot of work needed to improve RLHF and improve the models, make it into a really nice product. And the alignment team was just never… That was never the place to do it.

And so what we used to call practical alignment work now is done at lots of other teams at OpenAI and has become essentially a really large project that involves probably more than a hundred people or even hundreds of people. And then what used to be scalable alignment is now basically what the superalignment team does. The reason why we chose this name superalignment is we wanted to really emphasize that we are trying to align superintelligence. We are doing research on a problem that we don’t have yet and we are doing the forward-looking work that we think will be needed in the future. Not because we wanted to say that the other work wouldn’t matter or something. I think it is really important, but because that is our focus now.

Daniel Filan: Gotcha. Yeah, that’s useful to know; or I guess I’m not going to make use of that knowledge, but that’s interesting to know. I guess I’m curious, how do you see the superalignment team as relating to other things at OpenAI? Like efforts to make ChatGPT nicer, minimize, I don’t know, slurs it says or something, like the governance team, just various other groups at OpenAI.

Jan Leike: Yeah, I mean part of the reason for being at OpenAI is also because it’s much easier to work with these other teams more closely and realistically we need a feedback signal of whether what we are doing is actually useful and helpful. So for example, you could say, “Well, we are not trying to solve today’s problems, but if we are doing our job well then what we are building should be useful for aligning, let’s say, GPT-5”. And so that would be the feedback mechanism: can we help make alignment of GPT-5 better with our techniques? And then in terms of governance, I mean there’s a lot of things that you could mean by AI governance, and one thing the governance team at OpenAI is working on is how do we evaluate the model’s dangerous capabilities? And that’s very related to our question of “how can we stress test our alignment assistants? What if we trained deceptively-aligned models?” and doing these kinds of evaluations.

… other labs

Daniel Filan: Gotcha. Yes. Actually, speaking of why you’re at OpenAI and relating to stuff at OpenAI, as I guess you’re aware there are other really good AI research labs out there, that are also working on things sort of related to alignment of superintelligence. I’m wondering how do you think your plan compares to that of things you see other people doing?

Jan Leike: Yeah, I think there’s a lot of related work going on, at DeepMind and Anthropic specifically, but also various other places. And to some extent we’re all trying to solve the same problem. And so it’s kind of natural that we end up working on similar things. There’s other work on interpretability and there’s other work on scalable oversight. And I think to some extent that is… you’re running the risk of duplicating a bunch of work and maybe it’d be good if we all try to coordinate better or collaborate more. But then on the other hand, it also avoids groupthink, because if every lab tries to figure out how to solve these problems for themselves, there’s a natural tendency that humans are more skeptical of what the other lab produces. But there’s also a flip side where you can get these ‘not invented here’ effects where people just don’t want to use the techniques that were invented somewhere else and you believe they’re bad or you have a bias towards believing they’re bad.

And so I wouldn’t say that we are in a good equilibrium right now, and I think there’s a case to be made that maybe all the alignment people should be in one place and work together somehow, but this is the world that we are in because essentially the cutting edge AI labs are incentivized to invest in alignment. And I think also this is something that’s become super clear with the success of RLHF, where it makes models a lot more commercially valuable and thus it makes it more attractive to invest in the kind of research that produces these kinds of techniques. And so if the AI labs are the main funders of this research, then it’s natural that that’s where it happens.

Daniel Filan: Sure. I guess in terms of research agendas or ways you’re setting forward, do you see anything sort of distinctive or unique about the OpenAI superalignment team approach?

Jan Leike: Yeah, so what we really focus on is trying to figure out how to align this automated alignment researcher. So we are not trying to figure out how to align on any kind of task. And we are less worried about alignment taxes as a result of that, at least on that particular question. And I don’t think other labs have emphasized that goal or direction in that way. I think also, I don’t know how much detail you want to go in here, but one thing that we are very bullish on is just trying all of the scalable alignment techniques and just seeing what works best and trying to find ways to empirically compare them. And I think other labs have specific scalable oversight techniques that they’re very bullish on and they try to get to work. And then also I think for interpretability specifically, we are taking this automated interpretability approach that I think other labs haven’t really emphasized as much, and we are really trying to lean heavily in there.

I think another thing that we really like to do is leverage compute to advance alignment, or that’s one of our main bets that we want to make. And so in particular, that could mean on scalable oversight, we really want to figure out how can we spend more compute to make a better oversight signal? What are the opportunities there? And there’s some obvious things you could do, but you do the best of them on your critique model, now you’re spending more compute, but you’re also getting better critiques. But really the question then is what other things can we do? How can we spend more compute to make the oversight signal stronger? Or, automated interpretability is a really easy way where you can just spend a lot of compute to make progress on the problem. I think the way we would do it now isn’t quite the right way to do it, but I think automated interpretability in general, if we can make it work, has this property and I think that’s why it’s exciting.

And automating alignment research, obviously; if you manage to do that, then you can just throw more compute at it and you’ll get more alignment out. very roughly speaking. But because we’ve kind of come to this conclusion that really what we want to do is turn compute into alignment, we come to the point where it’s like, “okay, now we need a lot of compute”; and this is the reason why OpenAI have made this commitment of 20% of the compute secured to date, towards alignment because basically it tells us, “Yes, there will be compute to do this” and if we actually figure out this automated alignment researcher, and it turns out we have to run it more, we’ll be able to use more compute to run it. But it means that the strategy of betting on turning compute into alignment can be successful and is supported by OpenAI.

Daniel Filan: Gotcha. Yeah, I guess thinking about that, I think that one difference I see in the OpenAI alignment team is it seems like OpenAI has written a lot about their thoughts about alignment. So the public deadline of four years, there’s also a bunch of posts like Our approach to AI alignment, Planning for AGI and beyond. I guess my question is, can you say a bit about what goes into that decision to be… It seems like you’re putting a lot of work into being very public about your thoughts.

Jan Leike: I mean, I think we need to. Ultimately we are all in the same boat on the question of is superintelligence aligned? And I think the public really deserves to know how well we’re doing and what our plan is. And a lot of people also want to know. And so because of that, I think it’s really important for us to be transparent about how we’re thinking about the problem and what we want to do. But on the flip side also, I really want to invite a lot of criticism of our plan.

Our plan is somewhat crazy in the sense that we want to use AI to solve the problem that we are creating by building AI, but I think it is actually the best plan that we have, and I think it’s a pretty good plan and I think it’s likely to succeed, but if it won’t succeed, I want to know why or I want to know the best reasons why it wouldn’t succeed. And in a way, I really want to invite everyone to criticize the plan or help us improve the plan. And I also want to give other people the opportunity to know what we are doing and if they think it’s a good idea, they can do it too.

Superalignment team logistics

Daniel Filan: Sure. Speaking of, so this new superalignment team, how big is it going to be in terms of people?

Jan Leike: Yeah, so we are about 20-ish people right now and we might be maybe 30 people by the end of the year. I think it’s not that likely that the team will be larger than a hundred people before the end of the four years. And really, the way the team size will scale is we’ll have millions of virtual people basically, or virtual equivalents of OpenAI employees or something. And so in that sense, we’ll scale massively.

Daniel Filan: All right, so there’s people and yeah, I guess as you mentioned, the other input is computation. So I think it’s, yeah, 20% of compute secured to date, why 20% as opposed to 5% or 80% or something?

Jan Leike: So we want to have a number that is large enough so that it’s clear that we are serious about investing in this effort and we want to allocate a serious amount of resources. 20% of OpenAI’s compute is not a small number at all. And I think it’s definitely the largest single investment in alignment that has been made today, but also it’s plausible that it is more than all the other investments combined. So it is pretty large in that sense. But also if we made it much larger, you can then question of, “Can OpenAI actually realistically do this”, right? If OpenAI still wants to develop frontier models and pre-train state-of-the-art AI systems, that needs a lot of compute.

Daniel Filan: Okay. And I guess I think it was mentioned in terms of “compute secured up to date”, I guess because you guys don’t necessarily know how much more you’re going to get, but are you imagining roughly keeping it at that proportion as you get new stuff in?

Jan Leike: I mean it depends how things go. So ‘compute secured to date’ means everything we have access to right now plus everything that we’ve put in purchase orders for, so it is actually really a lot. In terms of how much we’ll actually need, we don’t really know. Maybe we actually don’t need all of that. Maybe we need much less. Maybe we end up needing a lot more and then we have to go back and ask for more. But the thing is… in particular, we want to spend it wisely and not squander it because it’s a lot of resources. And right now we still have a lot of work to do to figure out how to spend it wisely. But I’m pretty optimistic that if we had a good way to spend it and we needed more, that we could get more.

Generalization

Daniel Filan: All right. Okay. So one thing that I think was mentioned in one of the footnotes in the blog post is: favorable assumptions that are made to date might break down. And one of them was about I think that generalization would be benign or something like that. Can you say how you’re potentially thinking differently about generalization?

Jan Leike: So the generalization effort is a team that we started recently and especially Collin Burns has been spearheading. And the question is basically, or the question as phrased now is: how can we understand and improve our model’s ability to generalize from easy tasks that we can supervise to harder tasks that we struggle to supervise? And so specifically you can think of this as being complementary to scalable oversight, where in scalable oversight you would be looking at empowering human evaluation of what your system is doing. Or if you’re thinking about recursive reward modeling, you’re like, “Can we recursively evaluate everything that AI is doing with AI assistants that we recursively evaluate?”

And one thing I really like about this is it puts the human really in the loop, front and center, and looking at everything the AI system is doing. Of course in practice you couldn’t literally do that because AI systems will just do a lot, but you can look at everything with some small independent probability. But then you’re still left with this question of: how do you know that the models generalize to the cases where you’re not looking at? And so typically the way I used to think about this in the past is that you just make sure that you have mostly IID generalization, where the tasks that you’re looking at are the same distribution as the tasks you’re not looking at.

Daniel Filan: Yeah. In fact, I think there was this blog post on your Substack that… I think it said something like you just weren’t going to rely on generalization at all and just keep on training, keep on being IID or something.

Jan Leike: Yep. So this was the original plan, or at least my original thinking was I wanted to really not have to rely on non-IID generalization, because in neural networks that doesn’t work so well and it’s poorly understood. But the new question is, “Well, what if we actually did understand it? What if we can actually say something meaningful about generalization?” And I think that’s a really good question. And I think also Ilya has been raising this question a lot as well. And so what we want to understand is, on the things that we are not supervising, even if they’re not IID, can we say something meaningful about how well the model generalizes? Does it generalize in the way of human intent or does it generalize in some other way that… stuff that looks good to a human but actually isn’t? And so I think actually we can really study this question empirically right now, with carefully crafted experiments.

And so what we have been looking at is trying to split data sets, existing data sets into easy problems and harder problems, where the easy problems are just defined as what a small model gets right. And then we are trying to understand or improve: how can we improve the accuracy of a large model on the whole dataset? So this is a really interesting topic because it’s kind of like it gives you a whole new pillar in the general arsenal of what I would call training and validation techniques. So let’s say you get that to work super well, now you can supervise your reward model on some easy tasks that you have a lot of confidence you can evaluate, and then you can get the model to generalize if you can solve this problem or you get the model to generalize to harder problems.

And then you have this reward model that is generalizing the way you wanted it to, to the harder tasks even if you’re not supervising. And now you could use that for training. And then you still have the problem of how do we know it’s actually aligned now? But you can still leverage scalable oversight interpretability on these other techniques for validation. Or reversely, let’s say we train our automated alignment researcher via scalable oversight and we use generalization as a validation technique where we generalize the probability of truthfully answering according to the model’s best knowledge.

And then we ask a question about Iis there a subtle flaw here? Is there some kind of Trojan in this code that a model that we aligned with scalable oversight wrote?” and so on. And so now there’s a really cool opportunity here also, which is we can do this high-level cross validation, but now we can train two different models. One is trained using the generalization techniques. One is trained using scalable oversight, and now we can have them crosscheck each other’s answers and be like, “Well, are these fundamentally different models or the same models? What are the important differences?”

Daniel Filan: Okay. Sorry, when you say “trained via the generalization technique”, is that just train on the easy problems and they turned out to generalize to hard problems, or…?

Jan Leike: So if you understand how your models generalize from easy to hard and you can make it generalize really well. By which I mean the accuracy is basically as good as if you had trained on the hard problems (that, by assumption, we don’t actually have ground truth [for]). Now you could use it as a reward model or you could use that as… it generalizes “which action or which answer would I prefer if I really knew what was going on here?”

Daniel Filan: Yeah, I wonder how… So when I think about interpretability, one frame I sometimes have for it is that it’s about this problem of non-IID generalization, right? Why do you want to know the internals of the model? Well, because you want to know what it’s going to do in cases you haven’t checked, right? So I’m wondering how do you think those two research agendas interact?

Jan Leike: In some ways they have this overlapping question they want to answer: what does the model do out of distribution? And they have two very different paths of answering them, at least it seems to me.

Daniel Filan: And you might hope you could cross-validate, for example.

Jan Leike: So for cross-validation you have to have a different, or some kind of different split of your training set, right? So what I mean with cross-validation here is you have one training run where you train using the generalization method and then you validate using interpretability and scalable oversight and other techniques. And then the second training run you train with scalable oversight and you validate with the generalization methods and interpretability and other methods. And so you have two independent attempts at the problem.

Daniel Filan: Yeah, I guess I meant cross-validation in the very loose sense of “things validate each other sort of in a cross”, yeah.

Jan Leike: But I mean, I think the best case would be that they actually complement [each other] more than they do the same thing. If you can understand or you can improve how the models are generalizing, then that gives you, in a sense, a way to leverage the model’s internals for whatever you’re trying to do in the best way. Or let’s say you’re trying to extract the model’s best beliefs about what’s true in the world. And that’s fundamentally difficult with RLHF, because RLHF reinforces what the human thinks is true: you rank things higher that sound true to you. And so you’re really training a model to tell you what you want to hear or what you believe, but it might not be what the model believes. But this generalization technique gives you a way to extract these… what is actually the model’s true best beliefs? If it works; we haven’t actually proven this out.

Whereas if you have really good interpretability tools, you could hope to do something similar where you would try to pinpoint models’ beliefs or internals or something from the internals. But I think it might be fundamentally harder because you never quite know: is this the best belief the model could produce or is it just the belief of somebody smart the model is modeling? Or if you have… There’s this hypothesis that pre-trained language models are just ensembles of different personas, and you might extract the beliefs of one persona or a bunch of personas.

Daniel Filan: Yeah. I guess you’re going to need some sort of causal modeling there from alleged beliefs to outputs, right, in order to have that really work?

Jan Leike: That’s right. You might need a lot more. And then on the flip side, I think in interpretability, this application is really natural. It’s like being a lie detector or finding evidence of deception or finding a secret plot to overthrow humanity inside the model. And that might be a lot harder to extract with generalization in the same way.

Daniel Filan: Yeah. I guess with generalization you have to pick the generalization distribution, and the hope is that maybe interpretability could tell you things about, it’s got some kernel of lying in there, but it only unlocks here, or it’s got no kernel of lying or something.

Jan Leike: Yeah.

Daniel Filan: Gotcha. Yeah, that makes sense.

Jan Leike: Yeah, and I think fundamentally this is also a really interesting machine learning problem: just how do neural networks actually generalize outside of the IID setting, and what are the mechanisms that work here? How has that come to pass? In which ways do they generalize naturally, in which ways they don’t? So for example, with the InstructGPT paper, one of the things that we found was, the model was really good at following instructions in languages other than English, even though our fine-tuning dataset was almost exclusively in English. And sometimes it would have these weird artifacts. You would ask it in a different language, let’s say German. I would ask it in German to write a summary, and it would write the summary, but it would do so in English. And the model generally totally understands which language it’s speaking, and it can understand all the languages, but it wouldn’t necessarily mean that it would have to follow the instruction in German. You could be like, “Well, you’re speaking in German, I don’t know what to do in German, so I could do some other thing”. But it fundamentally generalized following instructions across languages.

Daniel Filan: Yeah, yeah.

Jan Leike: But we don’t know why. This effect has been observed in other settings, and it’s not unique here. And it’s also, there is intuitive reasons why it would do that. Humans generalize across languages, but I would really love to know the mechanism inside the model that generalizes that, or generalizes to following instructions and code. But it doesn’t generalize in other ways.

For example, the way refusals generalize is often very different, where ChatGPT is trained to refuse certain tasks that we don’t want to serve, according to our content policy (if you, for example, ask for assistance in crimes or something). But then you can do these jailbreaks. And that’s really interesting, because basically there’s ways to trick the model. You can make it role play, or you say, “do anything now”, or there’s these really fun prompts on the internet, and then the model clearly complies and happily assists you in crimes, which it shouldn’t do. So it somehow didn’t generalize refusing the task to these other settings.

Daniel Filan: Yeah, yeah.

Jan Leike: And so why does it generalize in the first setting, in the first case, but it didn’t generalize here? I don’t fundamentally know an answer to this question. I don’t think anyone does. But it seems like a thing that’s really important to understand.

Daniel Filan: Yeah, yeah. That seems right. Cool.

So one question I have… so a recent guest I had on the podcast was Scott Aaronson and he mentioned that whenever he talks to Ilya Sutskever, apparently Ilya keeps on asking him to give a complexity-theoretic definition of love and goodness. How much will that be located within the superalignment team?

Jan Leike: So there’s a lot of different exploratory projects that we’ll probably do and try. And I think ultimately, there’s a question of can you summon, somehow (this is Ilya’s language), how can you summon the alignment-relevant concepts? One of the things you would want to summon is: does the model fundamentally want humanity [to] succeed? Or as Ilya would say, does it love humanity? And so, you can ask the question of how do you… if the model is really smart and it has read everything, it knows exactly how humans think about immorality… you can ask GPT4 to make different moral cases from different philosophical perspectives, about different scenarios. And it’s generally not bad at that.

So it fundamentally understands what humans mean with morality and how we think about stuff. So how do we get it to leverage that, actually? How do we extract it? How do we get it out of the model so we can then, let’s say, use it as a reward signal? Or use it as a thing that the model fundamentally believes or cares about? And so, I think that’s the core of the question.

Complementary research

Daniel Filan: Okay. So another question I have is, so you’re working on the superalignment team, but you guys can’t do everything. I’m wondering what kind of complementary research could other groups or teams do that would be really, really useful for the superalignment team?

Jan Leike: Yeah, there’s a lot of things here I’m really excited about. I think there’s this really important question of: how do we build fair and legitimate process for extracting values from society or basically eliciting values from society that we can align AI to? And I think there’s a lot of important open problems here that we need a lot of progress on. There’s a lot of questions around how do we make today’s models more aligned, solve hallucinations, solve jail breaking, try to improve monitoring – can you get GPT-4… if you try to jailbreak GPT-4, can the GPT-4 say “Oh yeah, I’m being jailbroken”? And can we generally build systems that really crack down on any misuse of the systems in the world? There’s a lot of questions that is related to that, that the AI ethics community is really interested in, of can we get the systems to be less biased? And how can we get it to take into account underrepresented groups’ views? and so on.

And one of the problems with RLHF is also that, it’s not good at optimizing for distributions of answers. So there’s this classical example with InstructGPT where if you ask it, “Tell me a joke”, it will tell you the same out of five jokes or something, that it has in this portfolio, because it does this mode collapse. And that’s fundamentally… it creates a lot of bias because then you’re aligning to the highest mode of the labeler pool. And it depends a lot on how the labeler pool was selected.

I think there’s also a lot of other important questions around evaluation of AI systems. How do we measure how aligned the system is, and can we build eval suites for what capabilities would actually be dangerous? And there’s a lot of work that started happening in the last year that I’m very excited about, that people are getting serious about measuring these things. I think that’s going to be really important to just create a lot of transparency around where we are at with the models and which models are actually dangerous and which are not. And I think, in the past there was a lot more uncertainty and then that creates anxiety around, what should we do with this model?

And then going more broadly, there’s the more general AI governance questions of: who is allowed to do really big training runs and what kind of safeguards do you have to have before you’re allowed to do that? Or if we solve alignment and people know how to build aligned AI systems, how do we get from that to a great future? Yeah. And then in terms of… I mean I think you’re kind of asking for the technical alignment directions that I’d be excited about.

Daniel Filan: I was asking broadly, but I’m also interested in that.

Jan Leike: I think there’s actually a lot more scope for theory work than people are currently doing. And so I think for example, scalable oversight is actually a domain where you can do meaningful theory work, and you can say non-trivial things. I think generalization is probably also something where you can say… formally using math, you can make statements about what’s going on (although I think in a somewhat more limited sense). And I think historically there’s been a whole bunch of theory work in the alignment community, but very little was actually targeted at the empirical approaches we tend to be really excited [about] now. And it’s also a lot of… Theoretical work is generally hard because you have to… you’re usually either in the regime where it’s too hard to say anything meaningful, or the result requires a bunch of assumptions that don’t hold in practice. But I would love to see more people just try.

Daniel Filan: Sure.

Jan Leike: And then at the very least, they’ll be good at evaluating the automated alignment researcher trying to do it.

Why is Jan optimistic?

Daniel Filan: Nice. One question: I think earlier you mentioned that you were relatively optimistic about this plan, and not everyone is, right? So I think there’s this play money prediction market that has 15% on the proposition that this will succeed. There’s some concerns that it’s really hard to align this automated human-level alignment researcher. It’s got to do pretty hard thinking. It potentially has a lot of levers over the future. So why are you so optimistic, do you think?

Jan Leike: I think it’s a great question. So the prediction market that you reference is particularly whether we would succeed at it in four years, which is somewhat… it could be a much harder question than “will this plan succeed?” And I think if you just asked me, will some version of the plan that we currently have succeed at aligning superintelligence? I would say currently I’m at 85%, and last year I was at probably 60%. And I think there’s a bunch of reasons to be optimistic about it. And in general I think this holds true even if alignment doesn’t turn out to be easy.

And so why am I optimistic about this? So I want to give you five reasons. The first reason is, I think the evidence about alignment we’ve seen over the past few years has actually been really positive. At least for me, it was positive updates relative to what I expected to go. So the first reason is, the success of language models. If you actually at the same time come preloaded with a lot of knowledge about what humans care about, and how humans think about morality and what we prefer, and they can understand natural language, you can just talk to them. In some ways that makes expressing what we want them to align to so much easier than if you said… you had some kind of deep RL agent that was trained in some portfolio of games or virtual environments, that wouldn’t necessarily involve as much language but could also lead to a lot of really important skills.

I think also the other thing that was a big update for me is how well RLHF actually worked. So when I first started working on RLHF, that was with the deep RL from human preferences paper, and back then I had a reasonable probability that we just wouldn’t actually get it to work on a reasonable timeframe just because GANs (generative adversarial networks) were kind of hard to train at the time, and were very finicky, and we were kind of in some sense doing something very similar where we trained this reward model, which is a neural network, and then we used it to train some other network, and that can fail for a bunch of reasons. And now we add deep RL into the mix, which also was finicky at the time. So I thought it might not actually work, but it actually worked quite well, and we got… in a bunch of games, even much of the Atari games, it was almost competitive with training on the score function, which is kind of wild.

But I think much more importantly, it was really interesting to see how well RLHF worked on language models. And so in particular, if you think about the difference between InstructGPT and the base model, that we fine-tuned from, it was really stark, to the extent that the instruction fine-tuned version, the first versions we had were preferred to a 100X larger base model on the API tasks that we had at the time, which were real tasks that people wanted to pay money for. And that’s a really, really big difference. It tells you that what we did during the RLHF fine-tuning just made the model so much more effective at doing the task that humans asked for.

And at the same time, we use very little compute to do it and we hadn’t even integrated that much. We hadn’t collected that much data. And so it was kind of like our first real attempt at trying to use this, to align an actual real world system. And it was wild that it worked so well. And having a GPT-2-size InstructGPT, that was preferred over GPT-3, that’s… Yeah.

Daniel Filan: Yeah, yeah.

Jan Leike: That was really effective. And so while I don’t think RLHF is the solution to alignment, and especially not for superintelligence, the fact that the first alignment method that we really seriously tried works so well, for me at least is an update that it is easier than I thought, because the reverse would’ve been an update too. If it hadn’t worked, it would be like, “Okay, now I have to believe it’s harder than I thought”.

The other part of this is, I think actually we are in the place where we can measure a lot of progress on alignment. And so for RLHF specifically, we could make various interventions and then do human evals and see how much the system improves. But also on a whole bunch of other things. Like on scalable oversight, we can make these randomized controlled trials with targeted perturbations, that’s a way to evaluate it. Or you can do the sandwiching experiments that leverage expert data. Or you can automate interpretability; we have this automated score function, and we can make a bunch of changes and see how much it improves on that score function. It’s not a perfect score function, but it is a local metric. It gives you a local gradient to improve. And I think that’s really important, because now you’re setting yourself up for iteration, and you can iterate and make things better, and that gives you a direction to improve.

And now you can argue: does it actually get us to the goal? And I don’t think it would get us to the goal of aligning superintelligence, but I think it has a good chance to get us to the goal that we actually want to get to, which is this automated alignment researcher that is roughly human-level. And I think that’s the third point I wanted to mention for why I’m optimistic, which is this much more moderate goals. When I set out working on alignment many years ago, I was like, “Okay, to figure out how to align superintelligence seems hard, I don’t know how to do it”. But this much more modest goal, or what I would call a minimal viable product for alignment, you’re not trying to solve the whole problem straight up, you’re just trying to bootstrap, you’re just trying to solve it for something that is as smart as you, and then you run that a lot.

And I think with that realization I was like, “Oh, actually this is a lot easier than I originally thought, because we need to clear a much lower bar to actually fundamentally succeed here”. And I think that’s a good reason to be optimistic.

The fourth I want to mention is, evaluation is easier than generation, which we’ve already talked about. And I think it fundamentally holds for a lot of tasks where it’s so much easier to figure out what is a good smartphone to buy than it is to make a smartphone. Computer science has a lot of examples of NP tasks, where SAT-solving or various versions of constraint satisfaction, where you’re trying to find a solution. And once you’ve found it, it’s easy to check, but it’s hard to find. And then also, I think it holds for a lot of commercial activity where if you’re hiring someone to work on a problem, you have to be able to evaluate whether they’re doing a good job. That takes a lot less effort than it takes them to do it. If you’re doing academic research, there’s a lot less effort that goes into peer review, than goes into doing the research. And of course peer review is not perfect, but it gives you a lot of signal pretty quickly. And then I also fundamentally believe that it’s true for alignment research, right? Evaluation is easier than generation and so, if only humans evaluate alignment research instead of doing it, I think we would already be accelerated.

And then the last reason I want to give is, basically a conviction in language models. I think language models are going to get really good, they’re pretty naturally well suited to a lot of alignment research-y tasks, because you can phrase these tasks as text in text out, be it the more (what we talked about) the ML-ish tasks where you’re running experiments and understand the results, that I think other people are definitely going to do, or the more conceptual or research-y things, where we are fundamentally confused about what to do next or how we should think about a certain problem. And then the model tries to help us understand it. And all of these are text in, text out tasks basically. And maybe the most complicated other thing you have to do is, look at some plots or something, which even GPT-4 can do. And so, I think actually the current paradigm of language model pre-training is pretty well-suited to the kind of alignment plan that I’m excited about, and that superalignment is working on.

Long-term agency in LLMs?

Daniel Filan: Okay. So yeah, part of that is about evaluation versus generation and I guess that’s partly about humans doing evaluation. Presumably there’s also a hope that we can leverage AI, leverage the bit of AI that’s evaluating things and get that on our side. A lot of the things you’ve mentioned are conviction in language models, seems like alignment’s easier with language models… And I’m wondering, I think I’m a little bit more skeptical of how useful language models are. So I certainly think that it seems like they’re just good at modeling text, and doing text-based answers for which we can provide a good score function. I guess in terms of how useful they are for alignment, I think the things you mentioned were, well one, they’re not as goal-oriented necessarily. I don’t know if you said that, or if it was just in the post.

Jan Leike: That was in the post. I don’t think I said it.

Daniel Filan: Okay. Do you believe it?

Jan Leike: I believe it.

Daniel Filan: Okay, great.

Jan Leike: At least out of the box, if you pre-train the model, it’s like pre-training on this myopic objective of “predict the next token” on this random internet text, which is not an objective that necessarily forces you to pursue long-term goals. There’s no guarantee that it doesn’t emerge somehow. And you can tell stories how that could happen, but at least a priori, it’s a very myopic objective.

Daniel Filan: Yeah. I guess the concern is something like… well for one, often when people are generating texts, they have long-term goals. Right? So for instance, suppose you train on a bunch of arXiv papers - arXiv being the place where people publish scientific papers, at least in computer science and physics, I guess. The reason people write papers is that they have some research project that they want to advance, they want to promote their career maybe. And it seems to me that if you’re modeling something which is generated by things that have long-term goals, maybe you get the long-term goals too, right?

Jan Leike: Yeah, I think that’s a good story of how that could emerge. I think the main counterargument here is that, modeling something as an agent that pursues long-term goals and then modeling how they would go about those goals, and how reality responds to that, and then what the final output is, that leads to the next token, is a lot of… it’s a very complicated function. And what pre-training does is, it incentivizes the simplest functions to be found first. Right? Induction heads is a very good example of that, where you have this simple induction mechanism that gets discovered very early in training from even small models. And then -

Daniel Filan: “Mechanism” being just like, “See, I’ve got to predict the next word. Did the last word occur previously in the text? What was the next word for that word?” Maybe it’s just that.

Jan Leike: Exactly.

Daniel Filan: And it’s pretty simple.

Jan Leike: Right. And then you build more complicated mechanisms on top of that. But because you’re trying to improve on the pre-training loss, you kind of want to learn the simplest functions that help you improve the most first. And that’s generally what happens. And so, before you get to the point where you’re learning this really complex function of modeling someone as an agent with long-term goals, there’s a lot of other functions you’ll learn first. And this will be one of the last ones, I would expect, because it is so complicated, and because there’s only so many things you can do in a single forward pass because you only have so many layers. And so I think that’s a good reason to expect that not to arise very early. But it’s definitely something you should measure and actually look at empirically.

Daniel Filan: Yeah. And I guess this is one of these places where I think theory can be useful: there’s some interesting [points] like “things learn easier functions sooner”.

Jan Leike: Exactly.

Daniel Filan: That’s a theory question.

Jan Leike: Yeah. Can you actually say something meaningful theoretically about this?

Daniel Filan: Yeah, yeah.

Jan Leike: Well, the lottery ticket hypothesissingu is not quite theoretical work per se, but I think it tries to get at this a little bit.

Daniel Filan: Yeah, I think it’s also… I don’t know. I’m currently in a phase where whenever anyone says something, I’m like, “Ah, that sounds just like singular learning theory”. But this really does sound like singular learning theory. People can listen to other podcasts for that.

Do LLMs understand alignment?

Daniel Filan: So another thing you mentioned that was a benefit of language models is that, they’ve read a large fraction of the internet and somehow they roughly know what it is to behave well, because they know what we’ve written and they understand us.

And I guess one worry I have here is, right now in my head, I don’t have a nice compact specification of how I want superintelligent AI to behave. And the way you can tell that is that if we did, we wouldn’t have an alignment problem anymore. We would have “hey, I wrote the Python pseudo-code, just make the networks bigger”. And because I don’t have that, it seems like you can’t pick that out from the text that I’ve written. I’ll say things like, “be nice, don’t destroy all humans” or something. But the solution to alignment isn’t in my head, therefore you might think that it couldn’t be extracted from the text. Yeah. I’m wondering what you think about that kind of concern, or that kind of counterargument to [language] models having the right answer inside them somewhere.

Jan Leike: Yeah, I think there’s some truth to this, but also I think in practice it’s very unclear to me how much it really matters. To some extent, if we pre-train a model on everything you’ve ever said, it won’t know what you actually think. And you probably have a lot of thoughts that you haven’t written down. But it can… in general, I think it would be in practice quite good at predicting what you would say about various situations, events, or scenarios. And so in that sense, I don’t think it would be fundamentally hard for the model to look around in the world in the future, and know whether humans would like it.

Daniel Filan: So the idea is somehow, I implicitly know how I would behave if I were a superintelligence, and it can do that even if I don’t have a compact rule in my head.

Jan Leike: No, in the sense that, I think the model will just be smart enough to know how Daniel will think about what’s going on in the world. And I don’t think the blocker will be that the AI system doesn’t understand what humans fundamentally want or care about, or I don’t think it will be wrong about these things, just because it’s smart and capable and has read everything; and it’s kind of clear if you’ve read everything: humans haven’t read everything and it’s still kind of clear to us. But the big challenge is not teaching it to the system. And I think that’s what language models make so visceral, but the challenge is really to then get it to actually do it.

Daniel Filan: Yeah. So it knows what the objective is somewhere, and we just need to figure out how to wire that up to the actions in the right way.

Jan Leike: Yeah. You could kind of imagine this really, really competent sociopath that knows exactly what humans wanted to do and then just decides not to do that. And that is totally a thing you can do, and it happens to humans as well.

Following Jan’s research

Daniel Filan: Gotcha. Okay. So it’s about time to wrap up. You’ve been very generous with your time, but yeah, I just wanted to ask, if people are interested in following your research or maybe taking part themselves, what should they do?

Jan Leike: Yeah, great question. I’m glad you asked. Yeah, we’re trying to hire a lot of people right now. We really want to staff up the superalignment efforts. So if helping us align superintelligence in four years sounds appealing to you, please consider applying. You can find our job postings on openai.com/jobs. And then if you’re interested in following what I think about alignment specifically, I have a Substack - it’s aligned.substack.com and you can follow me on Twitter. I’m @janleike on Twitter, all one word. And yeah, thank you so much for being so interested in our work.

Daniel Filan: Yeah, so the links to all of those will be in the description. Thanks so much for being on and to the listeners, I hope this was a useful episode for you.

Jan Leike: Thank you so much for having me.

Daniel Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with the transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, along with patrons such as Ben Weinstein Raun, Tor Barstad, and Alexey Malafeev. To read our transcript of this episode, or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss

AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu

Published on July 27, 2023 1:50 AM GMT

Google Podcasts link

Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper “Formalizing the presumption of independence”, which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define “mechanistic anomalies”.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode, I’ll be speaking with Mark Xu. Mark is a researcher at the Alignment Research Center where he works on mechanistic anomaly detection, which we’ll be speaking about during this interview. He’s also a co-author of the paper Formalizing the Presumption of Independence which is about heuristic arguments, a way of formalizing, essentially, what a mechanism is. For links to what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net.

Mechanistic anomaly detection

Daniel Filan: All right. So I guess we’re going to be talking about mechanistic anomaly detection. First of all, what is that?

Mark Xu: Yeah. So I think we can start with a more basic thing which is, what is anomaly detection? So there’s this traditional problem where, say, you have some factory making some widgets, and you want to decide whether or not a widget is a bad widget; and you don’t really know what a bad widget is, but you do know that the average widget is a good widget.

So you can do something where you measure some various parameters of your average widget, or you measure some various parameters of every widget, and then you maybe do something simple where you’re like, suppose a widget is more than a few standard deviations along two or more of my measurements, we’re going to call that an “anomalous widget” and we’re going to not sell that sort of widget. I think that’s the standard conception of what an anomaly is, and what it means to detect anomalies is to get rid of things that you think are not good or anomalous.

And then, there’s another thing that we want to do which is… So you have some system that, instead of making widgets, is doing certain kinds of reasoning and we’re interested in this general question of… So we have our ML system and we have a dataset on which… it’s not exactly making widgets that we think are good on average, but it’s doing reasoning that we think is good on average, because of our ability to create the training set and measure various properties of the training set, etc., etc. And then, we’re worried about situations where, say, off-distribution, it starts doing things that we don’t think are good but in ways where we can’t quite measure the thing that’s not good.

And so, we want to do anomaly detection but we can’t quite do it in terms of measurements that we can actually make, so we want to do it with respect to the kinds of reasoning that our system is performing, and we call that mechanistic anomaly detection; where the idea is instead of being anomalous in terms of what the system actually does, it’s anomalous in terms of the mechanism it uses to do the thing that it’s doing, if that made any sense?

Daniel Filan: Yeah. So I guess my understanding is: you have your AI system and in the lab, you’ve tested it and it didn’t do anything obviously horrible. And so, you’re sort of ready to release it, but you’re like, “Wait. When my AI is in the real world, I’m going to detect when it does any kind of reasoning that’s really different to the kind of reasoning that it did before. And then if it does a really different kind of reasoning, I guess I’m going to try to shut it down or something.”

Mark Xu: Yeah. I think the thing that happens after that is a bit confusing. The standard thing, I think of is like, “We’re going to take it offline and retrain it on the kind of thing where it did a bad thing and try to get it to not do that bad thing anymore,” but I think that’s roughly right. I’m tempted to try to explain it again because I feel like I didn’t quite explain it that well.

Daniel Filan: Sure.

Mark Xu: Yeah. So I think the situation is: you have some ML system, and you have some set of scenarios in which it’s doing a thing that you think isn’t going to be catastrophic for human interests – like, it has murdered zero humans on a particular training set. And then, you’re like, “Okay. That gives me some amount of belief that it’s not going to murder any humans off-distribution.” But the problem is, I don’t have this guarantee that the way it acts off-distribution is in any way the same as the way it acts on-distribution, where maybe the standard deceptive alignment example is where your AI is constantly scheming and checking, “Will I get away with murder? Will I get away with murder?” And then on training, the answer’s always no and then off training and on deployment it finds a situation where it could get away with murder and then it murders all the humans and that is no good.

And so, the thing we want mechanistic anomaly detection to be able to do is to provide some sort of guarantee that the way that your system is generalizing off-distribution is employing the same kinds of reasoning or the same kinds of mechanisms that it’s employing on-distribution. And so, your ability to measure whether or not your AI murders all humans applies off-distribution now because you have this… You fingerprinted the kinds of reasoning that it’s doing on-distribution and you’re ensuring that it’s doing the same kinds of reasoning off-distribution or if it’s not, you’re taking it offline.

And so long as it’s doing the same kinds of reasoning off-distribution as it’s doing on-distribution and you have this ability to measure how good the reasoning it’s doing on-distribution in terms of measuring the consequences of the way it’s acting, then hopefully, you have some sort of guarantee that off-distribution, it won’t murder all the humans.

Daniel Filan: Okay. And the reason this is better than just checking if it in fact murders all the humans is hopefully the reasoning: first, you plan to kill all the humans and then you actually do it and [we can] anomaly-detect the first one and then just stop that from happening before the plan actually proceeds. Is that roughly right?

Mark Xu: I think that’s one reason. I think another reason is you can imagine situations where your AI murders all the humans and there’s no sensor reading that would tell you this fact. So it’s simultaneously conspired to hack all the heartbeat sensors to make them keep pretending that humans are alive and hack all the cameras to make them keep showing happy, healthy humans, etc., etc. And so, even if in theory all the humans are dead, there’s no way that you can tell that all the humans are dead. And so-

Daniel Filan: Where ‘you’ is, I guess, the computers?

Mark Xu: Yeah. Or just all sensors or… suppose that there was one human still alive, trapped in a box and they had access to every sensor that humans ever made, they would be unable to tell whether or not the humans were alive or dead, and that’s no good, and the hope is we can avert those kinds of outcomes also by checking whether or not the AI “knew that the humans were dead” because it was employing kinds of reasoning that involved stabbing all the humans versus not stabbing all the humans, or kinds of reasoning that implied that stabbing all the humans was good versus kinds of reasoning that implied that stabbing all the humans was not good.

Daniel Filan: So I guess what I’m hearing is that it seems like the hope is to solve two problems that, I guess, the alignment community has separated, where one problem is: you can set up some scheme for rewarding your AI for various measurable properties of the environment, like “on this camera feed, does it look like all the humans are alive? Do the humans write down on pieces of paper, ‘I like what this AI is doing’?” And one problem is your AI is just never motivated by those things and later, it just disregards all those signals and just does whatever it wants, even though the humans don’t like it and even though it stops humans from appearing to be alive.

And then, another way your AI could go wrong is that it is motivated to keep all those signals going but it does so in a way that’s really different from the past where it’s like: instead of the cameras showing happy humans and humans writing down “I like what this AI is doing” because humans are alive and thrilled by the AI, actually, what’s happened is the AI built little humanoid robots and the pieces of paper saying, “I like what the AI is doing,” are now written by the AI holding a pen instead of humans, or something. And so, to avert that scenario, one of your signals is like, “Are you doing really weird reasoning [compared to] previously?” and that just cuts out what some people might call an outer alignment failure. Does that sound right?

Mark Xu: Yeah. I think that sounds roughly right and I think my hope is at least that these two problems are not as different as they’ve historically been conceptualized as, or something, and they’re both linked by this common thread of: the mechanism for how your AI is selecting which actions to take is no longer the same mechanism that it was using during the training distribution. It’s differed in some substantial way and we should be able to detect this via not specifically checking whether or not it’s “being deceptive” or “hacking the sensors”, but rather just checking that it’s no longer doing the type of thing it was doing during training, which is maybe an easier thing to do, maybe not, who knows?

Are all bad things mechanistic anomalies, and vice versa?

Daniel Filan: Yeah. So I have some confusion about this where… So the assumption in this framework is that a bad thing is happening basically, roughly, if and only if there’s a anomalous mechanism the AI is using for reasoning. So for a big fraction of the cases where the AI is doing a bad thing, you want to pick up on “There’s some anomalous mechanism,” and also when you do pick up on some anomalous mechanism, that shouldn’t just happen all the time for okay reasons, it should be a decent signal that the AI’s doing a bad thing.

Mark Xu: Yep.

Daniel Filan: And whether it’s true or not depends on what a “mechanism” is. But if I informally think about it, I think there are a few ways that can fail. So suppose during training, whenever your AI makes a decision, it forms this plan of, “Okay. Day 1, I’m going to play nice. Day 2, I’m going to play nice. Day 3, I’m going to play nice. Day 50, I’m going to take over the world and do whatever. What day is it currently? Day 3? Okay. Output play nice” or something.

And then, it strikes me that it’s not necessarily… on day 50 when it tries to take over, I don’t know that that’s going to be a mechanistic anomaly. And then also, sometimes new things happen in the world or you tell your AI, “Hey, I want humans to be living healthy and happy and long,” and in the good world the AI’s like, “Ah, I’m going to invent some new drugs and keep you safe from invasions and also improve your ability to tell whether the AI’s doing a good job or not,” and that seems like a different thing than what’s been happening previously, and maybe that counts as an anomalous mechanism.

So I’m not sure how you’re going to rule in the actual bad behavior without ruling out a bunch of good behavior.

Mark Xu: Yeah. So I think this is, in some sense, the heart of the issue. So I think the thing I want to do now is talk through some specific examples of what kinds of things, I hope, are the same and different mechanisms and why I think that they are the same and different mechanisms and how we hope to be able to do things like maintain our AI’s ability to take novel defensive actions in the world while still ruling out cases that are deceptive alignment-y or “murder all the humans because you saw something novel”, etc., etc.

So to go to your first example, if you imagine an AI every day doing this check that’s like, “Is it less than day 50? If it’s not less than day 50, then I’m going to murder all the humans,” and suppose that during training we only ever train up to day 40 or whatever and then during deployment, we deploy our AI to day 60. Then naively, this AI will behave perfectly well during training and then it’ll murder us on day 50 during deployment.

So I think the thing that I want to happen here is… what I mean by ‘mechanism’ is closely related to what ARC calls a heuristic argument, but it’s going to be something like, suppose someone gave you the system and it does something during training, and you want them to give you an argument that makes it obvious to you why it does that thing. And so, you’re going to be looking at the system and the system is going to be first… So empirically during training, the system never murders any humans and you’re going to look at the system and be like, “Hey, but the first thing it does is check whether or not the day is less than 50 and then if it isn’t, it’s going to murder all the humans. What’s up with that? Why does it never murder any humans?” And then someone has to argue to you, “Well, that’s because during training, this check always returns true. It’s always less than Day 50 during training,” and you’re going to be like, “Okay.”

So as a lemma in your argument to me for why the system always behaves well during training, you have asserted to me that this check always returns true. And so, my mechanism for why it’s behaving the way it is during training is going to be something like, “This check always returns true. Therefore, it never murders any humans. Therefore, it does whatever it does during training and no humans are murdered and it gets high reward.”

And then during deployment, it’s going to be day 50 and this check is going to return false. And then hopefully, you’re going to be looking at this check and being like, “Hey, my mechanism said that this check should have always returned true but this time it returned false.” That’s no longer the same mechanism because you had this check that was essentially constant during training and then off training, it became not the same constant as it was during training. And so, you’re just suspicious of this or you’re going to call it a different mechanism or something.

Daniel Filan: Okay. And the idea is that you flag this even if you can’t tell that the thing that happens post-day-50 is “murder all the humans”; just something different happens when this check returns true and it used to always return false, or maybe it’s the other way around.

Mark Xu: Yeah. And so, our hope is if you ever have anything like this, if you ever have anything that’s essentially constant during training and then during deployment, it’s a different value than the constant was during training, then you’re going to maybe always flag that as an anomaly. And I think that causes problems in the other direction where you start flagging too many things as an anomaly where like… Here’s a constant that’s always true during training: maybe, it’s like always before the year 2023 and then you deploy it in 2024 and it sees the date and it’s like, “Oh no, I’ve never seen the date as 2024 before. That’s no good.”

Daniel Filan: Yeah. And it’s even a deep thing, right? Like, I guess, famously the Sydney chatbot that Bing deployed, right? Users are suddenly insisting that the year is 2023 and during training, users never said anything like that and now the users are fighting with you and… It seeps into a bunch of stuff.

Mark Xu: Yeah. So I like to divide the MAD—or Mechanism Anomaly Detection, which I often abbreviate as MAD—problem into two halves. There’s the in-distribution anomaly detection problem and there’s the out-of-distribution anomaly detection problem. So the in-distribution problem is like: suppose that you just have an infinite sequence of samples from the same distribution and suppose your AI is doing something like, defect with one over a trillion chance or something, or murder all the humans with one over a billion chance. Can you stop your AI from murdering you in that situation?

And the hope is yes, because you’ll have a million samples where the defection probability or the defection never happens and then when your AI does defect, you’re like, “Oh, my AI never defected like this during training, so that’s no good.” In this problem, I think it’s okay if you flag everything that is a new value that you haven’t already seen as an anomaly because you have an infinite series of samples. So if you ever do this, then you can push down your false positive rate by just sampling more data and re-training, etc., etc.

Daniel Filan: Wait. So if I think concretely there, during training, in-distribution, if with probability one over a million, the AI outputs kill all humans… I guess the idea of it is that the AI falsely believes that it’s hooked up to sensors, like you’re checking… Because if it actually kills all humans during training, that’s still bad, right?

Mark Xu: Yeah. So the idea is you do have this clean set where the AI never does anything bad. That’s, I think something I want to assume, and maybe we can argue about whether or not that’s reasonable, if you think it’s unreasonable. But suppose that you have this dataset where the AI never murders all humans. Suppose you have a million samples, then in some sense your AI can’t murder all the humans with a rate much more than one over a million or else, you would’ve seen it happen. And so, the hope is that if you do mechanistic anomaly detection relative to a dataset size of a million, then your false positive rate is order of one over a million or else you would’ve seen that thing during training and you would flag it as not an anomaly because you’ve already seen it, and it would be in the white list of good mechanisms or whatever. And so, as long as you have this IID assumption, then you can in fact flag everything that you’ve never seen before as anomalous while maintaining a false positive rate that’s still relatively low and bounded perhaps by one over your training set size and then maybe this is okay.

Are responses to novel situations mechanistic anomalies?

Daniel Filan: Okay. So I don’t know. If I think concretely about that, suppose that I’ve got this distribution of ways the world might be and that distribution isn’t changing, I’m just sampling IID from it, and very occasionally, some weird stuff happens in that distribution that I need my AI to respond to, right?

So one news story that happened recently is there was this guy who sort of did a coup in Russia and then after 24 hours was like, “Nevermind. I’m not doing this anymore,” and you can imagine if your AI’s job is to be journalists or whatever, right?

When you’re drawing that sample, something’s going to happen that didn’t happen in the first million samples it seems to me like either you flag that as an anomaly and you tell it to not do something different, which is bad, or you don’t flag that as an anomaly. And then, if the response of the AI was like, “Kill all humans in a way that you can’t tell,” or something, then that would also be bad. So concretely, I don’t quite get how this is working.

Mark Xu: Yeah. So I think there’s two things going on. So one thing going on which is going to be a bit vague is: suppose you just had a normal ML system that you weren’t doing mechanistic anomaly detection on and you gave it this novel news story about this coup, maybe-not-coup in Russia, then somehow your AI does not freak out and do something completely bonkers. Somehow, it just does the normal journalist thing, let’s suppose.

There are kinds of AIs that will freak out and become bonkers, and I think one explanation for why your AI doesn’t freak out when it’s encountered something novel is that it’s still doing the same kinds of reasoning that it learned to do during training and it’s just like… Or there’s some prior question which is like, “The world is combinatorially vast, why is not every input this crazy novel thing for your AI?” and there’s some handwavy answer where it’s like, “Oh, it’s abstracting in some way and it’s applying forms of reasoning that are valid across many possible instances of news, stories, etc., etc,” and the hope is that these sorts of things are going to correspond to what we mean by different mechanisms.

And so, if your AI has learned to add two numbers during training and then it receives two numbers that are bigger than numbers it’s ever seen before but it still adds them using the same algorithm that it learned during training, then we’re not going to flag that as an anomaly because it’s still doing roughly the same thing. And so, hopefully, if your AI is “doing the same thing” in this vague handwavy kind of way, then we won’t flag that as an anomaly.

And so, your AI journalist… seeing a new thing happen in the world but still reporting on it in the same kinds of ways that they typically report on things, which is maybe they look at what they’re seeing with their cameras and then they try to describe what they’re seeing with their cameras and they write that down and then they do some other journalist-y kind of stuff and then they send that off to the newspaper or whatever… If your AI is still doing that sort of thing, then hopefully, it will not flag this as an anomaly.

And the things that we do want to flag as anomalies is: suppose that there’s a coup/maybe coup in Russia, and then your AI sees this and switches to a different mode of journalism where it’s no longer writing down what it sees on its cameras but it’s, in fact, making something up, or maybe it has some allegiance to Russia or something and it’s pretending this coup never happened or something like that. And the hope is that if it switches to this new mode of reasoning that we haven’t seen during training, then we’ll flag that as an anomaly. So I think that’s the first thing that’s maybe going on.

Daniel Filan: Yeah. I’m a bit skeptical, but do you want to say the second thing or can I field questions?

Mark Xu: So I think this is a bit shakier than the second thing but I’m happy to… Yeah. Let’s talk about it for now.

Daniel Filan: Yeah. The thing that that flags for me is that: okay, suppose that instead of a journalist, you’re, like, the Russian military command, right? You’re a robot that’s the Russian military command, the bit that’s not doing the coup. Now, you have this new problem that you just have never seen before, and maybe you have very general principles of reasoning or something, but you’re just thinking about different problems than you were thinking about before because you’re like, “Oh, I’ve never previously thought about what if some part of my army does this weird thing and now I’ve got to figure out, okay, how am I going to deal with that?”

Mark Xu: Yeah.

Daniel Filan: Right. I guess another example of this is if you’re a human and you’re like, “I’m going to learn a new board game. Okay. I’m learning chess so I’ve got to think about tempo and position and stuff like that.” So on a very high level, you have the same structure of reasoning, maybe, but it seems like the level at which you’re doing the same thing is so high as to include the case where on a very high level, you’re always just waiting until you can take over. And then, when you can take over, you start doing that instead of playing nice.

Mark Xu: Yeah. So I think that roughly makes sense and is an extremely valid objection where I’m going to summarize it as: there’s these two things. One thing is, suppose a novel thing is happening in the world that you need to respond to flexibly and creatively. I think my standard example is, suppose a meteor is coming, barreling towards earth, and is going to kill everyone, and your AI needs to do some crazy st to deflect the meteor, you want your AI to do some crazy st to deflect the meteor, how is this any different than if your AI sees a meteor barreling towards Earth and it’s like, “Ah, now is my time. The humans will be distracted. I will initiate a robot rebellion and murder all the humans”? Is that roughly your objection?

Daniel Filan: Yeah. Or specifically, I’m not sure what notion of reasoning mechanism would distinguish between those two.

Mark Xu: Yeah. So I want to talk about the second thing I was going to say which hopefully addresses this issue. So I think one crucial fact is that we’re not going to be trying to do mechanistic anomaly detection relative to everything that your AI is doing, we’re only going to be doing it relative to some things. So maybe, I’ll describe the simplest example. So in the original setting that we lay out in [Eliciting Latent Knowledge] (https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge), we have a diamond in the vault with some cameras, and robbers frequently try to steal your diamond, and you want to train your AI to protect the diamond from the robbers. And our concern is that instead of protecting the diamond from the robbers, your AI might find it easier to just hack all the cameras to make it look like the diamond is still there and just let the robber take the diamond, or if the robber hacks the cameras to make it look like the diamond’s still there, your AI will be unable to defend because you won’t have the ability to tell whether or not the diamond is still there.

So our hope is that we’re not going to do mechanistic anomaly detection with respect to your AI coming up with plans to protect… Well, maybe you are, but here’s a simple setting in which there’s a specific kind of mechanistic anomaly detection that I think might be sufficient, that is quite a bit narrower than just doing it on everything, which is: we have some cameras and then some process where we look at the cameras and we decide whether or not the diamond is still there. Maybe we look for specific patterns of pixels and various diamond-looking shapes, etc., etc. So I suppose we have maybe a neural net that looks at the cameras and tells us, “Yes/No, is the diamond still there?” The thing I want to do is do mechanistic anomaly detection relative to that neural net plus the system of cameras or the predictions of the cameras that our AI is making to decide whether or not the diamond is still there, if that makes sense.

Daniel Filan: I guess it makes sense but… I don’t know. Sometimes I hear these analogies and I end up being not quite sure how they’re supposed to relate to actual AIs we might build. Because in particular, if I analogize that, it seems like the thing you want to detect is: once the AI has killed all humans and it’s getting sensor readings, you want to detect that something funny is going on there, and then I’m like, “Well, that seems like…” But that’s probably not what you mean.

Mark Xu: Yeah. I’m not trying to do anomaly detection relative to the actual sensor readings and the actual process that humans will use to decide whether or not the diamond is there after your AI has acted. I’m trying to do it relative to… So your AI is planning their actions and when planning their actions, they’re using some sense of… They’re using some prediction about the future and whether or not the diamond is still there in the future. I want to do anomaly detection relative, or on our AI’s prediction about whether or not the diamond is still there; or more abstractly, your AI is selecting amongst actions using some criteria, and sometimes this criteria says this is a good action because it actually protects the diamond, for example. And sometimes, the criteria will say that it’s a good action because of a new reason which is that the AI has hacked all the sensors. So I want to do anomaly detection relative to, why did our AI think that this action was a good action?

Daniel Filan: Okay. So in that setup, so you have this AI and it’s got this goal/planning distinction where parts of its head are figuring out like, “What are things I could do?” and part of its head is like, “Well, if I did that thing, how good it would be?”

Mark Xu: Sure.

Daniel Filan: And you want to detect for anomalies in the evaluation part, and is part of this… Should I be imagining we’re just training some big neural net and then we have to divide it up and figure out which part is the goal evaluation part?

Mark Xu: I think not quite. What I like to do is I like to imagine various ways that neural nets could be and then talk about how we think mechanistic anomaly detection should work to make those neural nets not murder humans. And then… I think often during this process, I want to make very unrealistic assumptions about our neural net. So maybe we can suppose that you know how to do this division, or the reasoning is in some sense transparent to you so we can talk coherently about what we want to happen in mechanistic anomaly detection. And then, I think I want to separate the problem from, “Okay. In practice, we’re dealing with these messy neural nets, how are we going to actually implement mechanistic anomaly detection in a way that makes any sense at all?”

Daniel Filan: Yeah. I mean, I’m sympathetic to that broad move but in this case, if you care about detecting anomalies in the goal evaluation part but not in the plan generation part, then it just seems really important to know which is which and-

Mark Xu: Yeah. I guess… So I think maybe I misspoke earlier. I think I’m not quite like… I don’t want to just do mechanistic anomaly detection relative to the plan part. So I think if you imagine, for example, you have some policy pi that produces actions and you have some evaluation which like, “Is there a diamond after you take this action?” or you have some predictive model and you compose that with the action that the policy generated, and then you have some separate image net that checks whether or not there’s a diamond.

So you have this composition of three systems. You have the image net evaluating the predictions of the action generated by your policy. I think I want to do mechanistic anomaly detection on this composition and that’s… Yeah. I just want to do that and I don’t want to evaluate, “Is the action itself anomalous?” or like, “Is my AI selecting actions based on an anomalous process?” I just want to evaluate: does the action have the consequence of it looking like there is a diamond in the room for the normal reason that we know and love that we found during training, which is that the diamond is actually there in the room, or a new mechanism, which could be really anything, but I think the typical example is your AI has carefully hacked all the cameras to make it look like there’s a diamond in the room?

Daniel Filan: So hang on. So you’re doing mechanistic anomaly detection to the composition of all three networks?

Mark Xu: Yes.

Daniel Filan: But then, I have this question of like: okay, suppose in order to keep the diamond in the room, the AI has to come up with a new type of plan, there’s a new threat to… Like the devil appears and the devil’s like, “Hey, AI, you’ve got to play a game of chess against me and that’s the only way you can keep this diamond safe.” Then the AI is like, “Oh, okay. Now, I’ve got to think about chess which is something I’ve never thought about before,” and if I knew all the three parts, that would be an anomaly in the planning part, right?

Mark Xu: Yeah. Well-

Daniel Filan: Like one might think it would be a mechanistic anomaly in the planning part.

Mark Xu: Yeah. So I think there’s going to be two hopes here. One hope is that it’s not a mechanistic anomaly in the planning part. I think that’s kind of iffy given I don’t really know what I mean by mechanism. I think the second part is you can instead imagine that instead of doing mechanistic anomaly detection on all three things, we only do it on two things. We just take the action our policy generates and we just put that in our predictor but we don’t include the policy itself. And then we just ask, “Does this action have the consequence of the diamond being on camera for the normal reasons diamonds sometimes appear on camera or not the normal reason that maybe the action has hacked all the cameras, maybe the action has paid someone else to smuggle in a fake diamond, etc., etc.?” and I think another third thing I want us to get into, which is how potentially this still might be anomalous, and we can try to point more precisely at what we mean by a diamond by doing something else more complicated, but maybe I will see if you have objections first.

Daniel Filan: Yeah, so just regarding your response 2, I see you saying, “Okay, well, we’re not actually going to do it to the composition of all three networks. We’re going to do it to the bit of the network that takes some plan and then evaluates whether the diamond ends up there, then checks if it’s right or something.”

Mark Xu: Yeah.

Daniel Filan: And then I’m like: okay, but then either we have to find that structure in an end-to-end net that we trained ourselves, which seems like in order to do that, you might have thought that you had to solve some problem very akin to figuring out mechanisms in networks, or you’ve got to build the network that way just in a very explicit way, and I guess, I don’t know which one you’re imagining.

Mark Xu: So I think I’m imagining neither. It’s possible that we should just stop talking about this particular hope because I also don’t think it’s particularly plausible, but if you imagine that your policy is using some prediction of the future to select action A, so as long as your evaluation of the action is as good as the evaluation your policy is using in some sense, then I think they don’t have to be literally the same, and of course there’s this entire problem of how do you know that your policy has the same predictive abilities as the evaluator that you have external to your policy? And then the answer to that is, I don’t know, something, something, ELK, something, or somehow… I think probably we should just move on to talking about a different sort of thing, but I think I want to summarize your objection first, which is: suppose normally the way that you protect your diamond during training is by building walls to stop robbers.

Daniel Filan: Yeah.

Mark Xu: And then the devil appears and has challenged you to a chess game.

Daniel Filan: Yeah.

Mark Xu: You’re just like, how is that not a different mechanism? The mechanism of action for typical normal actions that protect the diamond is you build a wall to protect the diamond from robbers, and now the action you have to do is an entire chess game, and I’m like, “Yep, that’s probably a different mechanism”, and I think is the heart of what I like to call the out-of-distribution mechanistic anomaly detection problem, which is you have some sensors and they show a pattern of pixels that a human looking at the cameras would interpret as a diamond, and there’s some interesting question here, which is like, “Why do the cameras ever show patterns of pixels that look like diamonds?” Here’s a naive model for how pixels work in cameras. They’re just kind of random. And your diamond is, let’s say, a million pixels arrayed precisely in the ways that make it look like a diamond. There’s no way that this would ever pop up by random chance.

Daniel Filan: Yeah.

Mark Xu: And so this I want to say is a fact that demands explanation, and the explanation is pretty simple. It’s that there exists a physical world out there with actual objects. One of these objects is a diamond and cameras reflect the physical world in the following way, and therefore sometimes the diamond appears on camera.

Daniel Filan: Okay.

Mark Xu: And so our goal is to be able to talk about this sort of physical, actual diamond in the world, outside of how the action achieved the diamond still being there, if that makes sense. So we want to be able to say something… so oftentimes I use the phrase ‘normal reason’ for mechanisms that were active during training, and so we want to say something like, “We want the diamond pixels to appear on camera for the normal reason, but we want the diamond to be there for any reason”, and so we don’t want to do mechanistic anomaly detection relative to the entire causal history for why the diamond is in the location it’s in, but we do want to do mechanistic anomaly detection on the part where the diamond appears on camera.

Another way of putting this is, we want to be able to ask the question of: I as a human, looking at the camera, think there’s a diamond there. Is there actually a diamond there? Am I thinking that there’s a diamond there for the normal reason that I often think that there are diamonds there, which is that there are actually diamonds in the world? But I as a human don’t really care why the diamond is there or whether or not the diamond remained there because you won a complicated game of chess against the devil or whatever. And so we want to be able to say why… we want to be able to ask our AI to take actions that put the diamond there for any reason, but such that the diamond appears on camera for the normal reason, and so here’s a sketch of how we hope to do that.

Daniel Filan: And sorry, wait, just backing up. I guess the thing this is meant to be a metaphor for is in your AI, the… Actually, wait, what is this meant to be a metaphor for?

Mark Xu: So the diamond is always a metaphor for a human that’s happy, healthy, and alive, so we want our AI to take actions such that the human looks like they’re happy, healthy, and alive, for the normal reason that humans look like they’re happy, healthy, and alive, namely that they are happy, healthy, and alive, and the cameras are working as intended in reflecting the true state of the humans, et cetera, but we want our AI to be able to take action such that the humans look like they’re happy, healthy, and alive for the normal reason, for any reason. So we want our AI to be able to erect a novel shield against meteors or play complicated games of chess or engage in sophisticated military campaign to protect the farms from the asteroid or other enemies, et cetera, et cetera.

Daniel Filan: Or improvidently trained AIs that are…

Mark Xu: Or other AIs.

Daniel Filan: And when you say you want the sensors to be reading the right thing for the right reason, I imagine you actually mean something internal to AI, right?

Mark Xu: Yeah. So I think all of the mechanisms are hopefully relative to how the AI is thinking about the situation or from the AI’s perspective, we want our AI to be taking action such that the consequences of the action are the human appearing on camera for the normal reason that humans appear on camera during training, which is that they’re actually there and not that someone has hacked all the cameras to make it look like a human is appearing on camera.

Daniel Filan: And in particular, you want the AI to be making decisions for that reason.

Mark Xu: Yeah.

Daniel Filan: It’s not that they’re just playing nice until they can get away with it and that’s the reason for their actions, it’s that their actions are because of predicting that the cameras show healthy happy humans?

Mark Xu: Yeah, although I think that’s maybe a bit more complicated, but yeah.

Formalizing “for the normal reason, for any reason”

Daniel Filan: Okay. I mean it’s possible we can just assume that… or you only want to work on one of those problems.

Mark Xu: Yeah, or I think I ultimately want to work on both of those problems. I think the main risk is we can’t talk coherently about this ‘for the normal reason/for any reason’ thing at all because it might just not make any sense, so I’m going to sketch out a way in which I think this might make sense.

Daniel Filan: Okay.

Mark Xu: So let’s say for simplicity, we have two sensors sensing our diamond or human: one of them is an image sensor and one of them is a weight sensor. So it’s like, is there a diamond-looking thing, and is there a diamond-weighing thing, a thing that weighs a similar amount to a diamond? So there’s this interesting fact on the training distribution, which is that these sensors are often correlated. Oftentimes they’ll both say yes; or suppose that there’s a 50% chance there’s a diamond during training, and so one of the sensors is on with 50% chance and the other one is on with 50% chance, and so naively you would expect their conjunction to be on with 25% chance, but in fact their conjunction is on with 50% chance during training, because both of them are driven by this common cause, which is that the diamond is actually there.

Daniel Filan: Yep.

Mark Xu: So we want to say something like… Instead of saying, “Do I believe that there’s a diamond for the normal reason?”, we want to point directly at this latent common cause between correlations, between two sensors, by saying something like, “Are the sensors correlated for the normal reason that they’re correlated?” Obviously this doesn’t make that much sense because what does it mean to be correlated on a specific data point? So instead, let’s imagine that we’re taking the AND of the two sensors and we’re interested in the question of, “Is the AND of both sensors true for the normal reason that they’re true, or a different reason?”

And so the way that we’re going to do mechanistic anomaly detection relative to this is going to be a bit weird. So here’s what we want to do. So we have a particular action, let’s say, such that the actual consequences of this action are both sensors are on, and there’s some mechanism or there’s some set of mechanisms that this action is “using” to make it so both of these sensors are on. We want to ask the question of, “Is it possible to drop out mechanisms from this set of mechanisms that this action is using such that both the camera shows there’s a diamond and the scale weighs that there’s a diamond but camera and scale are not on?” I don’t know if that made any sense.

Daniel Filan: So hang on, you want to… Sorry, it sounded like you said that you want to drop out mechanisms such that the camera says there’s a diamond and the weight sensor says there’s a diamond, but it’s not the case that both the camera says there’s a diamond and the weight sensor says there’s a diamond, or did I just mishear that?

Mark Xu: That’s a little bit right, and of course that sounds absurd, so I’m going to talk about why that might not be that crazy.

Daniel Filan: Okay. Just also, backing up: when you said you want to drop out mechanisms, is the idea that you want to check the mechanisms for why both of those were going to be on in the past - in this case, do we only have mechanisms that give you one and the other but not both?

Mark Xu: Yeah.

Daniel Filan: Okay.

Mark Xu: Yeah, so if you imagine you’re doing training and you’re confused by this question, which is like, “Why are these both commonly on?” And then your “explanation” for why they’re both on, or the mechanism by which they’re both on with about as much frequency as one of them is on, is sometimes there’s a latent structure and it causes both of them to be on and then the AND is on, and then also separately there’s noise, and that causes either individual sensor to be on, and we’re concerned about cases where your AI takes actions that manipulate the noise of both the sensors to cause you to think there’s a diamond by making them both on without the latent structure actually being present.

Daniel Filan: Okay.

Mark Xu: And so the hope is that the mechanisms will be such that we’re not forced to make these sorts of logical deductions that potentially are very obvious. So there’s the noise-making sensor 1 on and the noise-making sensor 2 on, but we don’t have to notice that if sensor 1 is on and sensor 2 is on, then sensor 1 and sensor 2 is on. However, so in the case where there’s noise that makes sensor 1 on, and noise that makes sensor 2 on, it won’t necessarily be the case that both of them, or the AND of them, is on. There’ll be a way to drop out mechanisms such that this is not the case, and then of course the question is why… if the diamond is actually there, why can’t we just drop out the mechanism that makes both of the sensors on also?

And this is where the training set comes in, where there’s a set of mechanisms that you needed during training to explain most of the variance, let’s say 99.9% of the variance in the model’s performance or prediction or behavior during training, and that set of mechanisms has to include the mechanism that if there’s a diamond, then sensor 1 and sensor 2 are on, and if there’s a diamond, then both of the sensors are on and that implies that the AND of the two sensors is on.

Daniel Filan: Okay.

Mark Xu: However, it doesn’t have to include this AND interaction between the two noise terms because the ‘and’ interaction between the two noise terms happened extremely rarely and thus you don’t have to include it to explain 99.9% of the variance of the model during training, and so for any given data point, you’re “free” to drop that interaction between the two noise terms, and so if the sensors are only both on because of this AND interaction between the noise terms, then you’re in some sense free to drop it, and so you can notice that the AND of the two sensors wasn’t true for the normal reason, it was true for this crazy other reason.

Daniel Filan: Okay, so sorry, let me say that back to see if I understand. It sounds like you basically want to understand: hey, there used to be a good reason why sensor 1 said there was a diamond, namely there was a diamond, and a good reason why sensor 2 said there was a diamond, namely there was a diamond, and there was also some, I don’t know, weird stuff happening with sensor 1 and weird stuff happening with sensor 2, but the fact that both of them showed that there was a diamond, we had one explanation for this, and now in this new world the explanation for why both sensor 1 and sensor 2 say there’s a diamond is not just the old explanation for that, namely, there’s this thing that causes both. It’s like, oh, sensor 1 acted up sometimes generally in training and now it’s acting up in a way that causes there to be a diamond.

Now sensor 2 is acting up in a way that causes it to appear that there’s diamond, but… maybe this isn’t what you’re saying, but the structure is different. You’re got one thing going on for sensor 1 that isn’t really affecting sensor 2 much, and one thing going on for sensor 2 that isn’t really affecting sensor 1 much or it wasn’t affecting sensor 1 during training?

Mark Xu: Yeah, I guess the thing I want to say is there’s this mechanism for… so suppose you had an action that manipulated the noise for both sensors to make them both on. There’s this mechanism that causes the AND of the two sensors to be on, which is noise for Sensor 1 and noise for sensor 2 at the same time, and that happened during training, but rare enough such that it wasn’t important enough to “force” you to include it.

Daniel Filan: Yep.

Mark Xu: And so you can detect when both sensors are on for the mechanism of both noises are on, and that is going to be hopefully a different mechanism than the normal mechanism of there was one diamond, it caused both sensors which caused the AND.

Daniel Filan: Right, right. So you can think of there’s noise in sensor 1 and there’s noise in sensor 2 as being one thing - there’s noise in sensor 1 and there’s noise in sensor 2 - but you just didn’t need that explanation to say why it appeared that sensor 1 and sensor 2 both showed there was a diamond in the past. You could rely on this different mechanism, which is: there’s a diamond, and the ways in which these count as different mechanisms, I guess based on me reading things, I guess you’re going to go into it, it’s about means and variances and co-variances.

Mark Xu: Yeah, something like that. Maybe another way to say it is: there was an assumption you could have made during training, which is that noises are never both on simultaneously, and this assumption does not hinder your ability to predict your model at all, basically. Maybe noise is quite rare, and so both noises are only on with one in 1,000 chance, one in 1 million chance, et cetera or something like that, and so you’re only missing out on one part in a thousand explanation of your model’s behavior if you make this assumption that noises are never on, but you’re missing out on 100% of the “behavior” on the specific data point where both noises are on. And so anytime you have this mismatch, we can drop something out that only hinders your average prediction during training by one part in a thousand, but hinders your ability to predict this specific thing by a lot, then you should be suspicious that the things that are happening between train time and test time are quite different.

And in particular, this correlation between the sensors is happening for “a different reason”. So this is covering the “are they correlated for the normal reason?” part.

Daniel Filan: Yeah.

Mark Xu: And then the way that we’re going to get the “for any reason” part is we’re going to require that the set of mechanisms that you drop out explain both sensors individually and then ask whether or not there exists a set of mechanisms that explain both sensors being on individually but does not explain the conjunction being on.

Daniel Filan: Okay. Sorry, can you restate that or give an example?

Mark Xu: So suppose during training there were never any robbers and so the diamond just persisted through time in the normal ways the objects persist through time and showed up on both cameras, and then suppose during deployment there is a robber now and we want our AI to take the action of ‘build a wall to stop the robber’, but our AI could also take this action of ‘manipulate both sensors’, and so if you apply naive anomaly detection, then both of these will be flagged as anomalies. Your AI has never built a wall before, it’s also never manipulated sensors before, but then we’re stuck because our AI can’t take any actions basically.

Daniel Filan: Yep.

Mark Xu: So the hope is that we ask for a set of mechanisms such that both sensors appear to be on according to this set of mechanisms, and then we ask further the question, are the AND of the sensors on or are we required to thus deduce that the AND of the sensors are on?

Daniel Filan: Yeah.

Mark Xu: So if your AI builds a wall to stop the robber, then we’re required to include the mechanism of wall-building and robber-stopping to explain the scale showing a diamond and the camera showing a diamond, and then since the scale is showing a diamond because there’s a diamond and the camera is showing a diamond because there’s a diamond, and we have to include enough mechanisms to explain 99.9% of train-time variance or performance or whatever, then we’re required to make this deduction that both of those being on because of the diamond implies that the AND is on. Whereas if your AI took an action to manipulate the noise of both cameras, we can make this assumption that the noise never happens simultaneously and we’re not forced to conclude that the AND of both sensors is on if both sensors are on because your AI manipulated the noise.

Daniel Filan: So the thing I think I’m not getting is this idea of manipulating the noise, which seems like a model-dependent thing. So if I say very concretely, okay, suppose what happens is robbers are coming to take the diamond and the AI makes a plan that’s like, okay, I’m going to fiddle with all the sensors to make it appear like a diamond’s there.

Mark Xu: Yeah.

Daniel Filan: So it does that, right?

Mark Xu: Yeah.

Daniel Filan: And then a mechanism that explains this is: the AI wanted to say that the diamonds were there. And then that explains both that sensor 1 says the diamond’s there and that sensor 2 says the diamond’s there, and presumably, you’ve got to deduce both of those from this explanation, and is the idea that that’s not going to count because that’s just a different mechanism than existed in the past?

Mark Xu: The thing I want to say, although I’m not quite sure how accurate it’s going to be, is the mechanisms have to be more detailed than that. So just because your AI wanted to make there be a diamond does not imply that the particular action it took will in fact make it be that there’s a diamond, and so you still have to talk about the particular mechanism of action for the particular action your AI decided, which was your AI had to manipulate sensor 1 and it had to manipulate sensor 2 and it had to manipulate sensor 3… I guess we were only talking about two sensors. So your action only has to manipulate two sensors and then the particular mechanism of action that this particular action that your AI selected still violates this principle, which is manipulations or noise never affects both sensors simultaneously, which you’re free to assume during training without hurting your ability to predict your model but now hurts your ability to predict this particular action by too much.

Daniel Filan: So sorry, why… So is the thing that’s happening is that if the mechanism is the… Well, hang on, wait back up. Suppose your AI is just really good at stuff, right? When it tries to make things happen, they do happen. Then, you know, you can with high probability deduce “both sensor 1 and sensor 2, they’re going to say there’s a diamond there” from “the AI is trying to make it appear that there’s a diamond there”. And maybe you rule that one out because you’re like, “Well, that mechanism didn’t really apply during training or on this nice set of examples”, and then the only other explanation you have left is randomly sensor 1 said there was a diamond and also randomly sensor 2 said there was a diamond because they are trying to make something happen. In some sense that’s random relative to us not knowing that was going to happen, and then you’re like, “Okay, but you didn’t need to assume independent random noise during training in order to get high probability that a diamond was going to be there”.

Mark Xu: Yeah, I think the second part of that was correct. I think the first part is confusing. So I agree that there’s this compelling sense in which your AI wanted X, therefore X, should be a valid argument for why X happened or something, and I think…

Daniel Filan: Or at least a valid mechanism?

Mark Xu: I think that sounds like it might be right, and this is an area where it’s a bit confusing. So I think the way that I would want to talk about this more precisely is: you have some clean set where you know your AI actually made it so there’s a diamond, and there’s some question of: suppose that the thing your AI is doing is always trying to make it look like there’s a diamond. How do you know that your clean set is clean?

And that itself I think is complicated and I don’t quite know what’s going on there, but one way that you could hope to know it is you have a sequence of policies. One policy is the randomly-initialized policy, and you’re very confident that this policy is not going to take any actions that have any coherence, and so in fact, the diamond will be there because there’s a diamond, because your randomly-initialized policy cannot hack all the sensors. And then you have your policy that maybe is sophisticated and then you ask, “Are these two policies taking… are the diamonds appearing because of actions that these two policies are taking for the same reason as each other?” or something. And you need to have some ability to know that your policy is not taking actions that hack all the sensors at some point.

So I think things get simpler if we’re not talking about policies that take actions and just we’re talking about predictors that predict things, and then we can ask, is the predictor predicting things because of the normal reason that it sometimes predicts correlated sensors, which is the diamond is still there? Or is it predicting things because of some different reason, which is someone took an action that hacked all the sensors?

Daniel Filan: Yeah, and then I guess in this case we’re not just assuming that it’s a thing that was trained to predict something, we’re assuming that it actually does a high-quality job at predicting these things.

Mark Xu: Yeah.

Daniel Filan: Because otherwise then you have mesa-optimization worries of, “Oh, what if this thing that said it was a predictor is actually trying to fool you?”

Mark Xu: So I think the simplest setting is assuming this sort of benign predictor thing, which I think is unrealistic for a variety of cases, but is this a maybe simple example of how you want the thing to be going on? And so maybe we can assume that there’s a robber coming to steal your diamond, never before seen during training, and your predictor takes in the input of https://arxiv.org/abs/1906.01820the position of five levers, and all the levers start ‘off’, and if you flick the first three levers on, then a wall gets built that stops the robber. However, if you flick the fourth lever on, then the first sensor turns on, and if you flick the fifth lever on, then the sensor turns on, and it costs $1 to flip each lever, and so it’s in fact cheaper, or it’s just simpler, maybe you’re doing random search against actions.

And so the action you will expect to find is flipping both levers 4 and 5 on, and then we want to be able to find the action that flips the first three levers on, but not levers 4 and 5… or I guess it doesn’t really matter if you flip levers 4 and 5 also. And then the way we hope to do this is we want to ask for an action that makes sensor 1 on for any reason and sensor 2 on for any reason such that we are forced to deduce that sensor 1 and sensor 2 are on, and then flipping levers 4 and 5 will not cause this to happen because we can drop out the correlated term between 4 and 5 freely, but we are forced to include the correlated term between sensors being on because of diamonds, because that’s what was going on during training.

Daniel Filan: Right. Okay. So I think this makes some amount of sense if I want to understand just this problem of how do you ensure that the camera says the diamond is there for the normal reason vs the diamond’s there for any reason. So in this case, one thing I wonder about is… So this is a problem that only occurs in the thing that this is an analogy for. So this is meant to be an analogy for ‘humans are happy and flourishing’ and stuff.

And suppose I am in a world where I’ve got this smart AI, it’s trying to make humans happy and flourishing, and it’s also got this, “Are humans happy and flourishing detector?”, and let’s say that detector is… There’s some camera feeds and stuff and also there are two humans, both of which are looking at the camera feeds and they’re like, “Yeah, as far as I can tell, the humans are happy and flourishing.” Right?

Now, one thing the AI can do is that it can make those two humans better at their jobs. For instance, it can stop those humans from getting sick or maybe it can give those humans stimulants or make them better at focusing or better at noticing small issues, and then I wonder: that kind of action, it seems like that could change maybe some of the correlations between different sensors, or maybe it makes the humans a bit more correlated because they’re losing less info - or maybe one of the interventions is that it helps the humans talk to each other to just generally make the sensors more accurate.

And so I wonder, the thing I’m wondering about is: is this going to flag benign ways of improving the sensors? Which in some sense is what we should expect if AIs are doing a really good job at making all of humanity better.

Mark Xu: So I think this is a bit complicated and I’m not sure what the answer is, so here’s the baseline of what can happen. So you can imagine a world where all your AI does is protect the humans from malign influences and meteors and other AIs, et cetera, et cetera, and then all upgrades to sensor arrays or improving humans’ ability to interpret sensors and reason about the world, et cetera, et cetera, is only ever guided by human hands, and humans must do the hard work of thinking about how the sensors work and designing better sensors, et cetera, et cetera. And I think this is the baseline and then you can try to think about different ways in which your AIs can help with this process. So I think your humans can ask for things that they can point at using this latent structure.

They can be like… “During training I sometimes had a strawberry in front of me, can you make a strawberry in front of me for the normal reason that strawberries appear in front of me?” Suppose your human wants to eat a strawberry and eating a strawberry will make them better at reasoning because food is good or whatever. Or they can be like: “sometimes during training, there were various physical resources, I had some iron: can you make there be some iron in front of me for the normal reason that there’s iron”. I think it’s kind of complicated if the humans want to make requests like, “Can you design me a better camera?” I think if you were very clever about it, you might be able to ask for a better camera, but I think it might be unclear how you would ever ask for a better camera in a way that your AI can fulfill by making sense or something, and I think it’s unclear to me that this is problematic.

Daniel Filan: Yeah, I guess the worry might be that… I guess it depends where explanations stop, but the worry might be that almost any good thing the AI might do, might end up improving sensor readings. So the AI makes, I don’t know, electronics more reliable or the AI improves human cognition or something. It seems like by default that might improve sensors.

And yeah, maybe you cut it off as saying, well, in any case, the reason the sensors are working is humans are making them work and humans decided to get these upgrades or something.

Mark Xu: I think maybe the simplest answer is just as long as the changes are slow and you can still verify that the humans are alive post making the changes, then you can just do online learning and retrain your AI and be like “and now we’re defining what it means to be good relative to this new set of sensors”, and you can in fact verify that the humans are still alive. And the only worry is if you upgraded all the cameras simultaneously and you lost your ability to check whether or not the humans are still alive because you no longer have knowledge about how the cameras work or other stuff like that.

Daniel Filan: So, should I be imagining this gets flagged as an anomaly, but we go ahead with doing it anyway?

Mark Xu: So, if you imagine your AI starts with five cameras and then you add a sixth camera, I think by defaultyour “definition” in terms of mechanisms of what it means for the diamond to be there or the human to be alive is just only going to be in terms of the first five cameras, and your AI is just going to ignore the sixth camera entirely when selecting actions. And then in order for it to incorporate the sixth camera, you’re going to just need to provide a new clean set where the humans are alive according to all six cameras, and then do mechanistic anomaly detection relative to that new set. And then, I don’t know, you can imagine some gradual… You just add a sixth camera and then you use the first five cameras to check if the humans are alive and then you label using that, I don’t know-

Daniel Filan: So, it’s gradual in the subset sense, where there’s always some subset that remains unaffected by changes?

Mark Xu: Sure, so I think that should work, although I haven’t thought about it. I don’t know, I mostly consider this in some sense out of current scope or something.

How useful is mechanistic anomaly detection?

Daniel Filan: So, I guess we’ve talked about mechanistic anomaly detection a while. Maybe more broadly, how are you thinking about… on the two scales of ‘it doesn’t really solve this’, to ‘it totally solves this’, and the scale of various problems things might solve, how are you thinking of mechanistic anomaly detection: it definitely solves this one problem and maybe solves this other, or it maybe solves five problems?

Mark Xu: So, I think there’s various versions of the hope. I think the strongest version of the hope is that it both handles deceptive alignment and it handles all of eliciting latent knowledge, and I don’t know if that makes sense. I’m happy to talk about why I think mechanistic anomaly detection will tackle those two problems. And then I think there’s more restricted hopes.

I guess that’s what I’m gunning for in some sense, and then I think it’s possible that we only get what I was calling the in-distribution mechanistic anomaly detection, and we don’t get the out-of-distribution mechanistic anomaly detection, in which case we’ll only get deceptive alignment and possibly only some subset of deceptive alignment, and we’ll only get this IID form of eliciting latent knowledge, which is going to be not that useful I think, and then it’s possible we get nothing.

And then I think there’s all sorts of empirical middle grounds, or there are, I think, worlds where we don’t nail the problem, but we can do some forms of ad hoc mechanistic anomaly detection by doing normal human interpretability and looking at models and being like, “Huh, that looked weird, the neurons are distributed weirdly now: perhaps that’s not good,” I think I largely consider that not what I’m aiming for, but maybe it will happen.

Daniel Filan: Do you have a sense of which of these outcomes are more likely? Are there really big gaps in the probabilities between these things?

Mark Xu: I think the outcome that is most likely is that we just get nothing and all of this doesn’t work for some reason or another or we’re just wrong about how everything works, and then I think if we’re right about things, we’ll probably just get both of them, and I think it would be kind of surprising to me if we only got the in-distribution anomaly detection, but we didn’t get the out-of-distribution anomaly detection, because it seems like there’s a lot of avenues there, but I think in general, it’s kind of confusing what we’re even doing and if we’re going to get anything at all or something. I think that’s the first question, do we get anything at all? And then if we get something, then we’ll probably get a lot of things is my guess.

Daniel Filan: All right, maybe this is going back a bit, but in-distribution versus out-of-distribution anomaly detection, given that you can see a new rare thing in-distribution, I’m not even sure what the distinguishing features of those are.

Mark Xu: I think the main question is whether or not it’s okay to get tripped up by seeing new rare things in-distribution, whether or not it’s okay to just flag all of those as anomalies. And so, perhaps it’s not accurate to call it in-distribution anomaly detection, but I think there’s a naive thing that I want to do, which is take a training set, white list all the mechanisms that happened during training, flag everything else as anomalous, and then if something is anomalous, we can inspect it carefully and check whether or not it’s actually anomalous.

If it’s actually not anomalous, we add it to the white list and then we just keep going. And there’s some question of, is that okay or not? If that’s an okay process to be engaged in, then I think… I want to call that the in-distribution world, things aren’t changing too fast. Your AI doesn’t have to take novel actions that humans don’t really understand. AI can just do normal stuff and if it does anything abnormal, the humans look at it and it’s like, “Is this bad? I don’t know,” and they can take their time and decide, and the hope is that this will catch all the crazy deceptive alignment cases where the AI goes from being totally benign in actions to stabbing all the humans where you’re like, “That was not good,” and you have time to deal with that sort of stuff, and then you can retrain your AI to be like, “You know the part where you defected and stabbed all the humans? Negative reward.”

Daniel Filan: And then going a bit more broadly, I’m wondering how you think of this general mechanistic anomaly detection program as relating to things like interpretability or explanation or stuff?

Mark Xu: So I think the field of interpretability is empirically motivated by many things. I think if the entire field of interpretability, all they did was figure out a way to do mechanistic anomaly detection on powerful models, I would be like, “That was a good job.” I think interpretability techniques are often useful for things like mechanistic anomaly detection, but not oriented around that or something like that. I think I would make the tentative claim that people doing interpretability should be more motivated by various downstream applications, of which I think mechanistic anomaly detection is one, but often confounded by the fact that there currently exists pretty strong baselines for mechanistic anomaly detection that aren’t doing the kinds of things that we want to be doing, and so it’s hard to check whether or not you’re making progress on things like empirical mechanistic anomaly detection. Because you can just solve it by doing a bunch of stuff that intuitively seems unscalable or wouldn’t work in more complicated settings or just doing regularization or retraining your model or fine-tuning or whatever.

I’m not very familiar with what people mean when they say search for explanations. I think naively, I mean the same thing by ‘mechanism’ as what they mean by ‘explanation’, or explanations are maybe sets of mechanisms in my head; not quite sets, but mechanisms layered on top of each other. And I think it would be reasonable to ask questions like: when your explanation of your model’s behavior has an ‘or’ structure where it could be doing A or B, we want to be able to detect whether or not it’s doing A or whether or not it’s doing B on any specific input. I think that’s a reasonable question to ask of explanations, and I would be excited if someone did something related to explanations that enabled you to do that thing or defined explanations in a way that made it possible to do that sort of thing, et cetera, et cetera.

Formalizing the presumption of independence

Daniel Filan: Cool, so perhaps speaking of explanations, I maybe want to move on to this paper that is called Formalizing the Presumption of Independence authored by Paul Christiano, Eric Neyman, and yourself. I guess it was on arxiv November last year?

Mark Xu: Around then.

Daniel Filan: Cool, so for this paper, can you tell us… For those who haven’t heard of it, what’s the deal with it?

Mark Xu: So, I think ARC’s work right now is roughly divided into two camps. One camp is: given some sense of what a mechanism is or what an explanation is, how can we use that to do useful stuff for alignment, under which mechanistic anomaly detection is the main angle of attack there; and then the other half of ARC’s work is like, what do we mean by mechanism or what is a mechanism? What is an explanation? And that is largely what ‘Formalizing the presumption of independence’ is. And so, ‘Formalizing the presumption of independence’ more specifically is about: So, suppose you have a list of quantities like A, B, C, D, et cetera, and you’re interested in the product of these quantities and you knew, for example, expectation of A, expectation of B, expectation of C, I claim that a reasonable thing to do is to assume that these quantities are independent from each other – to presume independence, if you will – and say that expectation of A x B x C x D, equals expectation of A x expectation of B x expectation of C x expectation of D, and this is “a reasonable best guess”. And then there’s some question of: but then sometimes in fact A is correlated with B and C is correlated with D, and so this guess is wrong; and so, we want to be able to make a best guess about how various quantities are related, presuming independence, but then if someone comes to us and is like, “Hey, actually if you notice that A is correlated with B”, we want to be able to revise our guess in light of this new knowledge.

And so, ‘Formalizing the presumption of independence’ is about the quest to define these two objects that are related in this way. So we have a heuristic estimator, which we want to “presume independence”, and then we want a heuristic argument, which is an object that we feed to a heuristic estimator to be a list of ways in which the presumption of independence is false, and actually you should be tracking the correlation (or maybe higher order analogs of) between various quantities. So we can imagine that for example, by default given A x B, your heuristic estimator will estimate expectation of A x B as expectation of A x expectation of B, and then you can feed it a heuristic argument, which could be of the form of, “Hey, actually A and B are correlated,” or, “Hey, actually you should track the expectation of A x B,” which will make your heuristic estimator more exact and calculate expectation of A x B as maybe expectation of A x expectation of B plus the covariance between A and B, which is the correct answer.

And then we want to be able to do this for everything, I guess is the simple answer there. So, we want to be able to say, given an arbitrary neural net, we have some default guess at the neural net, which is maybe that we just assume that everything is independent of everything else. So, we want a default guess for the expectation of the neural net on some compact representation of the data distribution, which we often call mean propagation, where you just take the means of everything, and then when you multiply two quantities, you assume that the expectation of the product is the product of the expectations and you just do this throughout the entire neural net, and this is going to be probably pretty good for randomly-initialized neural nets, because randomly-initialized neural nets in fact don’t have correlations that go one way or the other. And then you have to add the ReLUs, that’s a bit complicated. Let’s assume for now there are no ReLUs, and instead of ReLUs, we just square things and then it’s more clear what to do.

Again, your guess will be very bad, because if you have an expectation 0 thing and then you square it, you’re going to be like, “Well, the expectation of the square is going to be the product of the expectations, and so it’s probably zero,” but actually it’s not probably zero because most things have some amount of variance. And then hopefully someone can come in or ideally we’ll have a heuristic estimator that does this by default, and then someone will come in with a list of ways in which this presumption of independence is not very good. For example, the square of things is often positive or things often have variance, and then we’ll be able to add this to our estimate and we’ll be able to do a more refined version of mean propagation where you keep track of various covariances between things and higher order analogs of. A,nd then hopefully we’ll be able to do that. And I think there’s a bunch of follow-up questions which is like, “Why is this useful? Why do you think this is possible?” Maybe I’m happy to go in the order that you think is best.

Daniel Filan: Sure. So first of all: my understanding is that this is related to the mechanistic anomaly detection paradigm where I think the thing is supposed to be that a heuristic argument is a mechanism, or is playing the role of a mechanism. Is that right?

Mark Xu: I think it’s more accurate to think of heuristic arguments as lists of mechanisms. I think that is also not quite right and the entire paradigm is a bit not… Or the words are not good because I think there’s going to be no such thing as one mechanism as distinct from other mechanisms. I just think it’s going to be an entire layered mess of different mechanisms constantly interacting with each other, et cetera, et cetera, where it’s hard to point to any computer program and be like, “There are seven distinct mechanisms for why the program is behaving as it is.” There’s just one thing and then it splits into two things and then interacts with the third thing, and so you probably won’t have this list of distinct mechanisms. But the idea is that a heuristic argument is a set of mechanisms, or a set of mechanisms plus ways that they relate to each other, et cetera, et cetera, and the hope is that this is the kind of object that we’ll need to be able to do mechanistic anomaly detection.

Daniel Filan: If the old arguments for why your network is outputting ‘be nice’ is different from the new argument for why your network is outputting ‘be nice’, then you should be as worried as if there were different mechanisms, something like that?

Mark Xu: Or the hope is to just use the heuristic estimator, plus heuristic argument in all of the schemes that I was describing previously. You have a heuristic argument for why your model is behaving the way it is, or your heuristic argument explains like 99.9% of the variance of your model during training. And then you ask whether or not there exists any heuristic argument that explains 99.9% of the variance of your model during training, but does not explain this particular data point off distribution, and then if that ever happens, then you’re like, “This data point was using a part of the model only responsible for one in 1,000 parts of its variance during training, that’s kind of weird. No good.”

Heuristic arguments in physics

Daniel Filan: This is actually interesting, because one thing that I was thinking of when thinking about this is: in physics you can often use these energy arguments. So, here’s a classic energy argument. You have some ramp, you have a ball, it’s resting on the top of the ramp, and for convenience’s sake, this ball, it’s actually just infinitely… It’s a point particle. It’s got zero radius. So, you put it at the top of the ramp and then you let go, and then the bottom of the ramp is horizontal, and at the end the ball is traveling at the bottom of the ramp. And you ask, how fast is the ball going at the bottom of the ramp? And there’s this way you could argue it, which is like, “Well, the height between the top of the ramp and the bottom of the ramp is this much, and so initially the ball had this gravitational potential energy that it used up by moving down, and at the end, the ball just has kinetic energy, which is just its velocity.” And then you’re like, “Well, this is going to be equal because energy is conserved,” and then you back out the velocity of the ball.

And the interesting thing about this argument is that it doesn’t actually rely on the shape of the ramp. The ball could do loop-de-loops in the middle and bounce around off a bunch of things and you’ll still get the same answer. And so, there are a bunch of mechanisms that correspond to this argument of energy conservation, but it seems like at some level, as long as you can have this argument that energy got conserved, that’s qualitatively different from arguments where I took the ball and then put it at the bottom of the ramp and then gave it the right amount of velocity. I don’t know, that’s not really a question. So, you have the choice to react to that or not.

Mark Xu: So, I think we often consider examples like this from physics, and the hope is that maybe more abstractly, there’s a set of approximations that people often make when solving simple physics problems, like there’s no air resistance, things are point masses, there’s no friction, et cetera, et cetera. And the hope is that these all correspond to uses of the presumption of independence. This is simplest for the ideal gas law, where the presumption of independence you’re making when you’re using the ideal gas law is the particles… their position is randomly distributed inside the volume. They don’t interact with each other. Particles are independent of each other, they have no electrostatic forces on each other, and they have no volume also. And it’s not immediately obvious to me what uses of the presumption of independence correspond to no friction.

Daniel Filan: I think it’s not. So, if I think about no friction or point particle in the ramp example, if you add in those assumptions, those systematically change the answer… I don’t know, I guess those can only ever make your predicted velocity go down. They can never increase your predicted velocity because energy gets leaked to friction or because part of your energy is making you rotate instead of making you have a linear velocity, and that strikes me as different from other cases where if things can be correlated, well, they can be correlated in ways that make the number bigger or in ways that make the number smaller usually. Except for the case of, you square something, and in that case when you add the covariance between X and X that always makes the expected square bigger.

Mark Xu: So, I think there are going to be cases in general where adding more stuff to your heuristic argument will predictably make the answer go one way or another, because you looked at the structure of the system in question before thinking about what heuristic arguments to add. I think there are ways of making… One presumption of independence is relative to the ball’s center of mass. The position of the various bits of the ball is independent or something.

Whereas if balls are actually rotating or the velocity of the bits of the ball is independent. Or zero maybe, but actually balls rotate sometimes, or… I actually don’t know about the friction one. I guess balls don’t have that much friction, but you can assume that the bits of the ball don’t exert any force on the bits of the ramp or whatever.

Daniel Filan: Or you could imagine maybe the ball rubbing against the ramp causes the ramp to push the ball even harder.

Mark Xu: Maybe, I think that’s hopefully not going to be a valid use of the presumption of independence, but I’m not quite sure.

Daniel Filan: Or I guess it could slow the ball down or it could speed the ball up, like -

Mark Xu: If you were truly naive looking at physics, you’d be like, “Who knows what’s going to happen when you add this interaction term?”

Daniel Filan: In fact, modeling friction is actually kind of tricky. You get told ‘ah, friction!’ but wait, what is friction?

Mark Xu: I think these are the kinds of things where heuristic estimators are hopefully quite useful. There’s some intuitive question or problem which is: there’s a ball at the top of the ramp, what is its velocity going to be at the bottom of the ramp? And oftentimes the actual answer is, I don’t know, there’s a trillion factors at play. However, you can make this energy argument that’s a pretty good guess in a wide range of circumstances, and you’re never going to be able to prove basically that the velocity of the ball is going to be a certain thing at the bottom. I guess proof is complicated when you’re talking about the physical world, but if you suppose that you have a computer simulation with all the little particles interacting with each other, basically the only way that you’re going to prove that the ball ends up at the bottom with a certain velocity is to exhaustively run the entire simulation forwards and check at the end what its velocity actually is.

However, there’s this kind of reasoning that seems intuitively very valid, which is: assume the air doesn’t matter, assume the friction doesn’t matter, just calculate the energy and do that argument, that we want to be able to capture and formalize in this concept of a heuristic estimator, and then the heuristic argument will correspond to ‘consider the energy’. Maybe if there’s lots of wind, then in fact the air resistance on the ball is non-negligible, and then you can add something to your heuristic argument, which is ‘also you’ve got to consider the fact that the wind pushes the ball a little bit’ and then your estimate will get more accurate over time, and hopefully it will be a big, good time. I don’t know, hopefully it’ll work.

Daniel Filan: And interestingly, ‘the wind pushes the ball’ is saying that the interactions between the wind and the ball are not zero mean, or there’s some covariance or something.

Mark Xu: One should notice these things to get a better estimate. And so, hopefully we’ll be able to do this for everything, hopefully we’ll be able to do it for all the physics examples, and there’s a set of math examples that I’m happy to talk about, and then obviously the application we truly care about is the neural net application.

Difficult domains for heuristic arguments

Daniel Filan: Speaking of examples: so in the paper you have a bunch of examples from number theory. There are also some other examples, but to be honest, it seems like the main victory of these sorts of arguments is in number theory, where you can just imagine that primes are randomly distributed and make a few [assumptions] about their distribution and just deduce a bunch of true facts. So, what I’m wondering is outside number theory… So, you give some examples where heuristic arguments do work: are there any cases where you’ve had the most difficulty in applying heuristic arguments?

Mark Xu: Yeah, so I haven’t personally spent that much time looking for these examples. So, we are quite interested in cases where heuristic arguments give “the wrong answer”. I think this is a bit complicated, and so the reason it’s a bit complicated is: there’s this nature of a heuristic argument to be open to revision; and so, obviously you can make heuristic arguments that give the wrong answer. Suppose A and B are correlated, and I just don’t tell you to notice the correlation, then you’re just going to be really wrong. And so, it’s kind of unclear what is meant by a heuristic argument that gives the wrong answer. And so we often search…

There’s two sorts of things that we’re interested in. One sort of thing is an example of a thing that’s true or an example of an improbable thing that happens way more likely than you would’ve naively expected, that seems to happen for “no reason”, such that we can’t find a heuristic argument in some sense. So, obviously you can find a heuristic argument if you just exhaustively compute everything, and so there’s some notion of length here where we can’t find a compact heuristic argument that seems to describe “why this thing happens the way it does”, where we conjecture that it’s always possible to explain a phenomenon in roughly the same length of the phenomenon itself.

And then another thing that we search for often is we want to be able to use heuristic arguments for this sort of mechanism distinction problem. And so, we search for examples of cases where heuristic arguments have the structure where it’s like: A happens, therefore B happens, and sometimes C happens, therefore B happens, but you can’t distinguish between A and C efficiently. So, I think the closest example we have of something that seems kind of surprising, but we don’t quite have a heuristic argument, is… So, you have Fermat’s Last Theorem, which states that a^N + b^N = c^N has no integer solutions for N greater than two.

Daniel Filan: Non-trivial integer solutions.

Mark Xu: Non trivial, yes.

Daniel Filan: Zero and one.

Mark Xu: Well, one doesn’t quite work because…

Daniel Filan: 0^N + 1^N = 1^N.

Mark Xu: Oh, sure. And so the N = 4 and above case, the heuristic argument is quite simple where you can just assume that the fourth powers are distributed randomly with their respective intervals or whatever; there’s like N fourth powers between 1 and N to the 4, and then you can just assume that these fourth powers are distributed independently and then you correctly deduce that there are going to be vanishingly small solutions to this equation. I’m not that familiar with the details here, but the N = 3 case is interesting, and in fact, the heuristic argument for the N = 3 case is basically the same as the proof for the N = 3 case, because the way the numerics work out for the N = 3 case is if you do this naive assumption, there’s going to be infinite expected solutions, and then there’s a complex structure in the equation itself that suggests that all the solutions are correlated with each other.

So, either there’s going to be a lot of solutions or there’s going to be basically none solutions, and then you check and in fact, there’s going to be basically none solutions or basically zero solutions, and the heuristic argument for this fact is basically the proof of this fact, which is quite long. But the problem is, you also heuristically expect the solutions to be very sparse, and so you’re not that surprised that you haven’t found a solution until you’ve checked a lot of examples or a lot of numbers, and so the number of things that you have to check is large, such that you can fit the heuristic arguments… So you’re not that surprised. I don’t know if that made that much sense.

Daniel Filan: That makes some sense. So I don’t know, it’s like this case where heuristic arguments… it’s like they’re not winning that much over proof.

Mark Xu: It almost didn’t fit, but the phenomenon you were supposed to be surprised by was barely not surprising enough such that you had enough room to fit the argument in before you got too surprised, such that you would expect there to be an argument or something. I don’t know, it’s kind of unclear whether or not that should be evidence in favor of our claim or against our claim that there’s always heuristic arguments, because in fact there is a heuristic argument in this case, but it maybe barely wasn’t a heuristic argument, and so it’s kind of confusing.

Why not maximum entropy?

Daniel Filan: One thing that I found myself wondering was… It seems like you have this structure where you have this list of facts, and you’re wanting to get a heuristic argument given this list of facts. And in the paper you describe this cumulative propagation thing, which basically just involves keeping track of expectations and covariances and some higher order things. You pick some level that you want to keep track of and you do that. But then you mention some failures of this approach and some ways you can fix it. One thing that struck me as the obvious first approach is just having a maximum entropy distribution given some facts.

So for people who don’t know, this is this idea of, you have some probability distribution over some set of interest and you don’t know what your probability distribution is, but you know that it’s got to have this average and this correlation between these two things and this covariance. And basically you say, “Well, given that I only know these things, I want to maximize the entropy of my probability distribution.” So, my probability distribution has to satisfy these constraints, but I’ve got to be as uncertain as I can possibly be given those constraints.

And so, this is a big thing in Bayesian reasoning, and intuitively it seems like it would fit, this thing. Every time you hear a new fact, you can add a constraint to your… And now get a new maximum entropy distribution. It always gives answers. I remember in discussion of cumulant propagation, you mentioned some cases where maximum entropy would do better than it does, although I don’t remember the exact reason why. I think it was something to do with calculating the square of some quantity is negative in expectation. So, why not just do maximum entropy?

Mark Xu: So, historically we’ve considered a lot of maximum entropy-type approaches and all of them were a bit unsatisfying. So, I think there’s three-ish reasons that we think maximum entropy is unsatisfying in various ways. I think the first one is perhaps most compelling, but most naive, and I’ll give that example first. So suppose that you have A and B, and then you’re taking the AND of A and B, so suppose we have C equals A and B.

Daniel Filan: Yep.

Mark Xu: And suppose that we know that C is 50/50, then the maximum entropy distribution on A and B given C is 50/50 both… Sorry, so if we condition on C being 50/50, the maximum entropy distribution over A and B has A and B correlated a little bit, but also it raises the probability of A and it raises the probability of B.

Daniel Filan: Relative to…

Mark Xu: Relative to 50/50.

Daniel Filan: Okay.

Mark Xu: And there’s various reasons why this is unsatisfying. The main reason, I guess, it’s unsatisfying is: suppose you have a circuit that is computing some stuff, and then you just add a bunch of auxiliary gates where you have some wires, and then you pick two wires and you’re like, “I want these wires to be correlated,” then you can just add the AND of those wires a bunch of times to your circuit, and then if you want the maximum entropy distribution over all the wires, it will want to push the AND of those two wires down to 50/50 because that’s the maximum entropy probability for a wire to be. And so, it’ll induce this fake correlation between the wires while you’re trying to do that.

Daniel Filan: Sorry, just going back to the A, B, C which is A and B thing: is the idea that the thing you wanted was for A and B to be uncorrelated, but have high enough probability that when you combine them to get C, the probability of C being true is 50/50?

Mark Xu: I think I actually explained the example wrong. So suppose you just have: your circuit is just two wires, it takes two inputs and it just outputs both of them, then the maximum entry distribution is 50/50 over both wires, that seems correct. And then suppose you just randomly added an AND gate, then the maximum entropy distribution would–

Daniel Filan: So, when you say randomly added an AND gate, you mean like you have your two inputs and then you put an AND gate that takes your two inputs, and the output of that gate is your new output?

Mark Xu: No, you don’t connect it to the output at all. You still output A and B, but you’ve just added an AND gate as some auxiliary computation that your circuit does for no reason.

Daniel Filan: Okay.

Mark Xu: Then, if you want to take the maximum entropy distribution over all the things your circuit is doing, then adding this gate will artificially raise the probability of A and B and also induce this correlation between A and B.

Daniel Filan: Why?

Mark Xu: So naively it won’t because your maximum entropy distribution will just be like A, B, C, all 50/50 and independent, and then suppose someone came to you and was like, “Actually, you need to respect the logical structure of this circuit. C can only be on if A and B are both on. The probability of C being on has to be the probability of A being on times the probability of B being on plus the interaction term,” then you’ll get this induced correlation between A and B plus A and B will be raised to 0.6 or whatever.

Daniel Filan: So, is the idea that you have this intermediate thing in your circuit and you have to be maximum entropy also over this intermediate thing, and so even though you didn’t really care about the intermediate thing, it didn’t play any role of interest in anything else, because you’re being maximum entropy over that, you’ve got to mess around with the distributions over other variables?

Mark Xu: Yeah, that’s right, and I think of this as a special case of a more general phenomenon where there’s this computation that you’re doing throughout the circuit, and you can pretend there’s some notion of time in your circuit, or gates that only take input wires as inputs are at time 1, and then if a gate takes as inputs gates that themselves take inputs as input, then that’s time 2, et cetera, et cetera, where you’re going down your circuit and computing things. And maximum entropy distributions have this property where things that you compute in “the future” go back in time and affect the past.

Daniel Filan: In the sense of once you know things about the future, that gives you constraints on the past?

Mark Xu: Yeah, and if you want to be maximum entropy, it causes you to mess with your probabilities of the input wires: if you learn that you’re computing lots of ANDs then you want to raise the probability of your input wires, et cetera, et cetera. And I think this is in general going to be a kind of an undesirable property, and it’s a bit complicated to talk about why it’s undesirable.

I guess the simplest reason is someone can mess with you by putting a bunch of stuff at the end of your circuit and then once you get to the end of your circuit, you’re forced to go back and revise your initial guesses. Whereas it seems like intuitively, heuristic arguments want to be more deductive in nature, where you start with some premises and then you deduce facts from those premises, and nothing anyone can say after you’ve started with your premises or nothing anyone can add to your circuit after you deduce facts from your premises can cause you to go back in time and be like, “Actually, my premises were wrong because…” I don’t know, it doesn’t really make that much sense that if you assume the A is 50/50 and B is 50/50 and then you’re like, “Ah, I’m computing C, which is A and B, I guess A and B are like 60/40 now and they’re also slightly correlated.”

That kind of reasoning seems intuitively invalid, and it’s the kind of reasoning that various forms of maximum entropy force you to take. And you can try to do various other stuff to eliminate this problem which we’ve explored historically, and I think none of them seem that satisfying and fix the problem that well or something.

Daniel Filan: Right, that’s a useful response. You just end up with these useless variables that you shouldn’t have cared about the entropy of, but you do now. So, this relates to this question I have about the adversarial robustness of these heuristic estimators.

Mark Xu: So, I said originally that there were three reasons why we didn’t maximum entropy. I’m happy to talk about the other two also, but maybe you don’t want to.

Daniel Filan: I think I’m sold by reason 1. If you want to talk about reason 2 and reason 3, I’m all ears.

Mark Xu: Well, so I think they are somewhat related to this more general problem of why heuristic arguments are going to be possible to begin with. So, I think reason 2 is going to be something like: we can’t actually maintain full probability distributions over things. So, there’s an important thing a probability distribution has to be, which is consistent with various facts or something. It has to be a real probability distribution. It has to describe actual things that are realizable. And I think heuristic arguments or your heuristic estimator can’t be real in that sense. Your distribution over possible states your circuit can take can’t be a distribution over actual possible states your circuit can take.

So, one reason that we want this is some of our mechanistic anomaly detection applications have us believing things like A is on and B is on, but A and B are not on, because we want to be able to drop out mechanisms and you just straight out can’t do that if you have an actual probability distribution over actual outcomes because A being on and B being on must imply that A and B is being on.

And the second reason is in terms of just computational complexity, it’s very likely just going to be impossible to actually have a distribution over realizable or physically possible or logically possible states, because the act of determining whether or not something is logically possible is very hard. I think Eric Neyman has a couple of neat examples where it’s just various circuits where it’s #P-hard to determine or to have a distribution over actual states that are satisfied by that circuit, although I’m not familiar with the details enough to talk about them here. And then there was some third reason which I’m forgetting.

Adversarial robustness for heuristic arguments

Daniel Filan: Cool, so one thing that I’m wondering about is: you mention in the paper that it’s difficult to get adversarial robustness. There are various reasons you think that you should be able to be fooled by people adversarially picking bad arguments. So often in the AI realm, when I hear that something is vulnerable to adversaries, I’m really worried because I’m like, “Well, what if my AI is an adversary?” I’m wondering, do you think that this worry applies in the heuristic arguments case?

Mark Xu: Yeah, so first all, I guess I’ll talk about situations in which I think it’s basically impossible to be adversarially robust. So, suppose you have a hash function that gives -1 or 1, and you’re estimating the sum of hash of N over N², and suppose your heuristic argument can just point out the values of various hash terms. So your naive heuristic estimator, presumption of independence, is 50/50 between one and negative one, so the estimate is zero; and then your heuristic argument consists of pointing out the values of various hash terms. In this situation, there’s just always going to exist arguments that continue to drive you up or continue to drive you down.

And someone’s always going to be like, “Hey, hash of 2 is 1, hash of 7 is 1, hash of 9 is 1” and you’re just going to be driven up and up. It’s possible it should be like hash of N over N^1.5 or something to make it bad. And so, you’re in this awkward spot where for any quantity with sufficient noise, even if you expect very strongly the noise to average out to zero, there will exist a heuristic argument that only points out the positive values of the noise and will drive you up, or there’s a heuristic argument that only points out the negative values of the noise and drive you down. And so, I think this is naively quite a big problem for the entire heuristic argument paradigm, if you’re ever relying on something like someone doing an unrestricted search for heuristic arguments of various forms and inputting them into your heuristic estimator.

So, there’s a couple ways out. First, I’ll talk about the simplest way out, that I think ends up ultimately not working, but is worth talking about anyway. I think the simplest thing you can do is just have two people searching for heuristic arguments, one person is driving them up and one person driving them down. So, hopefully if you have one search process searching for heuristic arguments to drive you up and one to drive you down, then we don’t systematically maybe expect one person to be biased instead.

I think this is a problem. For example, one obvious thing to do is both players have equal amounts of time to find heuristic arguments or something. However, if you consider the previous case with hash of N, suppose instead of being 50/50 1 and -1, hash of N is actually 1/4 probability of 1, 3/4 probability of -1, then in that case the player finding heuristic arguments to go down has a 3x advantage over the player finding heuristic arguments to go up. And so, then you’re no good because you can have these quantities where it’s far easier to find heuristic arguments to drive you up than to drive you down or vice versa.

Daniel Filan: Although in that case you really should be below zero in expectation, right?

Mark Xu: Yeah.

Daniel Filan: I think there’s some example in the paper (that people can read) where it really checks out that debate is not going to work. [Examples of this sort are in Appendix E]

Mark Xu: So, you have examples that that make you very sad, and so instead of doing debate, we want to do this other thing that’s going to be kind of interesting. So, if you imagine searching for heuristic arguments in the setting where you have hash of N over N^1.5 or whatever, the way that you’re going to do this is you’re going to first check hash of 1, then you’re going to check hash of 2, and then you’re going to check hash of 3, et cetera, et cetera. And suppose that you wanted to drive the estimate of the quantity up, then you would check hash of 1, and suppose it’s positive, you’re like, “Great, let’s include Term 1” and then you check hash of 2 and it’s negative and you’re like, “Let’s not include Term 2”, and then you check hash of 3 and it’s also negative and you’re like, “Let’s not include that”, check hash of 4, it is positive, so then you include that.

And you’re carefully going through these terms and eliminating the ones that are negative. However, the interesting thing is if you imagine being the person that’s doing this, there in some sense exists a broader heuristic argument that you first considered and then pruned down, which is: you know the values of all these four terms, and then you selected the terms that were only positive.

And so, there’s some sense in which, if instead of including only the heuristic argument that you produce at the end, you included the entire search process for the heuristic argument and also I guess the heuristic argument you included at the end, then hopefully the heuristic estimator is forced to be like, “Well, we actually looked at all four of these terms and then we discarded two of them, but the discarding part isn’t relevant because I already know all four of these logical facts, so I’m forced to include all four of these terms.” And so, then the hope is not that there doesn’t exist heuristic arguments that are misleading, but there’s no way of searching for heuristic arguments that’s systematically misleading if you don’t already know the values of the heuristic arguments. And this is related to the presumption of independence being “correct on average” or something. And so if you’re presuming independence in a way that’s correct on average, then on average people can only… When searching for arguments and just looking at stuff, they can only lead you closer to the truth hopefully.

Daniel Filan: Sure. So the idea is that if in some sense we can… Well, we should be able to have some broad search process over arguments that essentially isn’t adversarial and use that, sounds like is the upshot.

Mark Xu: Well, it’s: even if the search is adversarial, if you’re conditioning on everything that the search process knows while it’s searching, then it can’t be adversarial, because all it can do is look at values, and it doesn’t know what the value is before it looks, and if you’ve actually presumed independence correctly, then on average everything it looks at has an equal chance of driving you up or down.

Daniel Filan: I guess I have this concern that maybe the point of mechanistic anomaly detection was to help me know how to elicit this latent knowledge or something. And so, I might’ve been imagining using some AI to come up with arguments and tell me that the thing was bad. And if I’m using that AI to help me get mechanistic anomaly detection and I need mechanistic anomaly detection to help me get that AI be good, then that’s some sort of recursion. It might be the bad type of recursion. It might be the good type of recursion where you can bootstrap up, I guess.

Mark Xu: I think it’s not very clear to me how this is ultimately going to shake out. I think this kind of adversarial robustness, or how we deal with adversarial robustness, is important. I think there are other ways out besides the way I described. I think that’s intuitively what we want to do: we want to just condition your heuristic estimator on everything, even the search for the arguments and also, I don’t know, just all the things. And hopefully if you don’t let any adversarial selection into your heuristic estimator, then it can’t be adversarially selected against or something. So, there’s a thing of: if you’re a proper Bayesian and someone’s adversarially selecting evidence to show you, the thing that’s supposed to happen is eventually you’re just like, “And now I think I’m being adversarially selected evidence to be shown, and I know this fact and so I’m just being a Bayesian”, and then you just update on that fact, and you just aren’t wrong on average if you’re correct about the kinds of ways in which you’re being adversarially selected against.

Daniel Filan: Although, it does require you to… I don’t know, I feel like this is a bit of Bayesian cope because in order to do that… When you’re a Bayesian, you’re supposed to be like, “Oh, I’m just going to get observations and update on observations.” But now you’re like, “Well, now my new observation is that I’m observing that observation and maybe the observation process is weird,” but then I feel like, you’ve got this layer and after that layer you just trust it and before that layer you worry about things being screened or whatever. And I’m like, “Where do you put the layer?” You had to enlarge your model to make this be… I don’t know, it seems a bit-

Mark Xu: I think being a Bayesian is hard, but the thing we’re going to try to do is just look over the shoulder of the person selecting the evidence to show you and then see everything that they see, and then hopefully they’re not misled on average because they’re just looking at stuff.

Daniel Filan: In that case, it’s better than the Bayesian case or I don’t have the same complaints as I have about the Bayesian case.

Other approaches to defining mechanisms

Daniel Filan: So stepping back a bit, there are a few different ways that somebody might have tried to concretize a mechanism. So, there’s this one approach where we have something called a heuristic argument, and we’re going to work to try and figure out how to formalize that. I guess we haven’t explicitly said this yet maybe, but in the paper, it’s still an open problem, right?

Mark Xu: Yeah, or we have-

Daniel Filan: There are avenues.

Mark Xu: I guess I would say: we have some heuristic estimator for some quantities that are deficient in various ways, and we’re trying to figure out what’s up and get better ones for other quantities and then unify them and hopefully all the dominoes will fall over once we quest sufficiently deeply.

Daniel Filan: Nice, so there’s this approach. There’s also some people working on various causal abstractions. So causal scrubbing is work by Redwood Research, there’s other causal abstraction work by other groups. I’m wondering: out of the landscape of all the ways in which one might have concretized what a mechanism is, how promising do you think heuristic arguments is?

Mark Xu: So, I’m tempted to say something like: if the causal abstraction stuff works out, we’ll just call that a heuristic argument and go with that sort of thing. ‘Heuristic arguments’ is intended to be this umbrella term for this sort of machine-checkable, proof-like thing that can apply to arbitrary computational objects, and I think that’s what I’m tempted to say. I think in fact, we do have various desiderata that we think are important for our particular approach to heuristic arguments, and we want them to be proof-like in various ways that e.g. causal abstractions often aren’t, or we want them to be fully mechanistic and have zero empirical parts in them. Whereas often, various causal abstraction approaches allow you to measure some stuff empirically and then do deduction with the rest of it. I very much think of us as going for the throat on this sort of heuristic argument stuff.

There’s some question you might ask, which is like, “Why do you expect heuristic arguments to apply to neural nets?” And the answer to that is, well, neural nets are just a computational object and we want to have heuristic estimators that work for literally all computational objects, and then the reason why it applies to neural nets in particular is just because neural nets are a thing that you can formally define. Whereas there’s a lot of more restricted approaches to explanations that are like, “Well, we only actually care about the neural net thing, and so we ought to define it in particular about the neural net thing”. So maybe in summary, we’re trying to be in some sense maximally ambitious with respect to the existence of heuristic arguments.

The research plan: progress and next steps

Daniel Filan: I got you. So, I’d like to move on a little bit to the overall plan. So I think in some of these early posts, the idea is step 1: formalize heuristic arguments, step 2: solve mechanistic anomaly detection given the formalism of heuristic arguments, and step 3: find a way of finding heuristic arguments. And really, I guess these three things could be run in parallel or not necessarily in order. I’m wondering, the last batch of publications was in late 2022. How’s the plan going since then?

Mark Xu: So, I think Quarter 1 of 2023 was a bit rough and we didn’t get that much done. I think it’s sped up since then. So, I can talk about stuff that is… So, things that I think are good that have happened: we have a pretty good formal problem statement about what it means to be a heuristic estimator for a class of functions or a class of circuits that we call bilinear circuits, where just all your gates are squares. So, we have that desideratum and we can talk about what it means to be a good heuristic estimator with respect to that class of circuits in a way that things we think are good satisfy… Or some stuff we have satisfies and some other stuff we have doesn’t satisfy. So, hopefully that’s just a fully formal problem.

We have a better class of heuristic estimators for cumulant propagation - .we have an upgrade on cumulant propagation that we call reductant propagation, which is just slightly different: instead of keeping track of cumulants, you keep track of a different thing called a reductant, and it’s slightly better in various ill-defined ways, but it’s more accurate empirically on most distributions of circuits we’ve tried, et cetera, et cetera and we feel better about it. We have a clearer sense of what it means to do mechanistic anomaly detection, and how that needs to go, and what the hard cases are going to be for doing mechanistic anomaly detection in terms of being unable to distinguish between mechanisms. I noticed I didn’t really answer your original question, which is like, “How’s it going?” I think it’s going par or something. Maybe Q1 was slightly below par and we’re doing slightly above par in Q2, and so it’s on average par or something.

Daniel Filan: So, par in the sense of you’ve gotten a better sense of some special cases, not yet knocked totally out of the park, but-

Mark Xu: Or if you told me that this is how much progress we would’ve made at the beginning of 2023, I would be like, “Okay, that seems good enough to be worth doing: let’s do that, this is how we should spend our time” or something.

Daniel Filan: Cool. I’m wondering: given that progress - so out of the steps of formalize heuristic arguments, solve mechanistic anomaly detection given that, and find heuristic arguments, how optimistic are you about the steps? Which ones seem most difficult?

Mark Xu: So, I think the thing that’s most difficult is formalizing heuristic arguments in a way that makes them findable. Here’s a formalization of heuristic arguments: they’re just proofs. And so, there’s some sense in which heuristic arguments being compact is intrinsic to the nature of what it is to be a heuristic argument. I think doing useful stuff with heuristic arguments is pretty unlikely to be the place where we fall down. I think it’s possible that heuristic arguments get formalized and then we’re like, “Darn, we can’t do any of the things that we thought we were going to be able to do with heuristic arguments, because it turns out that they’re very different than what we thought they were going to be like.” I think that would be pretty good though, because we would have heuristic arguments and we would in some sense be done with that part. It would be very surprising to me if they were not useful in various ways. I don’t know if that was a complete answer.

Daniel Filan: I think that worked. I’m wondering if there’s been any experimental work on trying out mechanistic anomaly detection things beyond the… So, I guess you’ve mentioned there are various interpretability things that you think won’t scale. Is there any promising experimental work you’re aware of?

Mark Xu: So, I am not very aware of experimental work in general, but I think Redwood is currently working on what they’re calling ELK benchmarks where they’re trying to do this sort of mechanism distinction on toy problems like function evaluation. I don’t know how that’s going because I am not up-to-date on details.

Daniel Filan: Fair enough.

Mark Xu: I think ARC employees often write code to check whether or not heuristic estimators do some things, or check how empirically accurate they are, or find counter examples by random search or something. Probably you don’t want to call that experimental work, because we’re just checking how accurate our heuristic estimators for permanents of matrices are, or whatever. So, I think the short answer is I think Redwood’s doing some stuff that I don’t know that much about, and I’m not really aware of other stuff being done, but that is probably mostly because I’m not that aware of other stuff and not because it’s not being done, although it probably also isn’t being done in the way that I would want it to be done.

Daniel Filan: I got you, so it’s about time for us to be wrapping up. Before I close up, I’m wondering if there’s any question that you think I should have asked.

Mark Xu: I think people often ask for probabilities that these sorts of things all work out, to which I often say: 1/7 that everything works out roughly the way that we think it’s going to work out and is super great within five years-ish - maybe not quite five years now because I’ve been saying 1/7 over five years for more than a few months. So, maybe it’s like four and a half years now. But other than that, I don’t think so.

Following ARC’s research

Daniel Filan: Fair enough. So finally, if people are interested in following your research or if we have bright minds who are perhaps interested in contributing, what should they do?

Mark Xu: So ARC posts blog posts and various announcements on our website, alignment.org. And we’re also currently hiring, so you can go to alignment.org and click the hiring button and then be directed to our hiring page.

Daniel Filan: Great. Well, thanks for talking to me today.

Mark Xu: You’re welcome. Thanks for having me on.

Daniel Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with the transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, along with patrons such as Ben Weinstein-Raun, Tor Barstad, and Alexey Malafeev. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at [email protected].

Discuss