Normal view

There are new articles available, click to refresh the page.

Before yesterdayAI Alignment Forum

Dequantifying first-order theoriesJessica Taylor
Published on April 23, 2024 7:04 PM GMTThe Löwenheim–Skolem theorem implies, among other things, that any first-order theory whose symbols are countable, and which has an infinite model, has a countably infinite model. This means that, in attempting to refer to uncountably infinite structures (such as in set theory), one "may as well" be referring to an only countably infinite structure, as far as proofs are concerned. The main limitation I see with this theorem is that it preserves arbitrarily
24 April 2024 at 03:04

Dequantifying first-order theories

Published on April 23, 2024 7:04 PM GMT

The Löwenheim–Skolem theorem implies, among other things, that any first-order theory whose symbols are countable, and which has an infinite model, has a countably infinite model. This means that, in attempting to refer to uncountably infinite structures (such as in set theory), one "may as well" be referring to an only countably infinite structure, as far as proofs are concerned.

The main limitation I see with this theorem is that it preserves arbitrarily deep quantifier nesting. In Peano arithmetic, it is possible to form statements that correspond (under the standard interpretation) to arbitrary statements in the arithmetic hierarchy (by which I mean, the union of and for arbitrary n). Not all of these statements are computable. In general, the question of whether a given statement is provable is a statement. So, even with a countable model, one can still believe one's self to be "referring" to high levels of the arithmetic hierarchy, despite the computational implausibility of this.

What I aim to show is that these statements that appear to refer to high levels of the arithmetic hierarchy are, in terms of provability, equivalent to different statements that only refer to a bounded level of hypercomputation. I call this "dequantification", as it translates statements that may have deeply nested quantifiers to ones with bounded or no quantifiers.

I first attempted translating statements in a consistent first-order theory T to statements in a different consistent first-order theory U, such that the translated statements have only bounded quantifier depth, as do the axioms of U. This succeeded, but then I realized that I didn't even need U to be first-order; U could instead be a propositional theory (with a recursively enumerable axiom schema).

Propositional theories and provability-preserving translations

Here I will, for specificity, define propositional theories. A propositional theory is specified by a countable set of proposition symbols, and a countable set of axioms, each of which is a statement in the theory. Statements in the theory consist of proposition symbols, , , and statements formed from and/or/not and other statements. Proving a statement in a propositional theory consists of an ordinary propositional calculus proof that it follows from some finite subset of the axioms (I assume that base propositional calculus is specified by inference rules, containing no axioms).

A propositional theory is recursively enumerable if there exists a Turing machine that eventually prints all its axioms; assume that the (countable) proposition symbols are specified by their natural indices in some standard ordering. If the theory is recursively enumerable, then proofs (that specify the indices of axioms they use in the recursive enumeration) can be checked for validity by a Turing machine.

Due to the soundness and completeness of propositional calculus, a statement in a propositional theory is provable if and only if it is true in all models of the theory. Here, a model consists of an assignment of Boolean truth values to proposition symbols such that all axioms are true. (Meanwhile, Gödel's completeness theorem shows something similar for first-order logic: a statement is provable in a first-order theory if and only if it is true in all models. Inter-conversion between models as "assignments of truth values to sentences" and models as "interpretations for predicates, functions, and so on" is fairly standard in model theory.)

Let's start with a consistent first-order theory T, which may, like propositional theories, have a countable set of symbols and axioms. Also assume this theory is recursively enumerable, that is, there is a Turing machine printing its axioms.

The initial challenge is to find a recursively enumerable propositional theory U and a computable translation of T-statements to U-statements, such that a T-statement is provable if and only if its translation is provable.

This turns out to be trivial. We define U to have one propositional symbol per statement of T, and recursively enumerate U's axioms by attempting to prove every T-statement in parallel, and adding its corresponding propositional symbol as an axiom of U whenever such a proof is found. Now, if a T-statement is provable, its corresponding U-statement is as well, and if it is not provable, its U-statement is not (as no axioms of U will imply anything about this U-statement).

This is somewhat unsatisfying. In particular, propositional compositions of T-statements do not necessarily have equivalent provability to corresponding propositional compositions of the translations of these T-statements. For example, if translates to and translates to , we would like to be provable in T if and only if is provable in U, but this is not necessarily the case with the specified U (in particular, is only provable in U whenever at least one of or is provable in T, but can be provable in T without either or being provable.).

We could attempt to solve this problem by introducing propositional variables corresponding to quantified statements, and an axiom schema to specify implications between these and other statements according to the inference rules of first-order logic. But first-order logic requires supporting unbound variables (e.g. from P(x) for unbound x, infer ), and this introduces unnecessary complexities. So I will give a different solution.

Recap of consistent guessing oracles

In a previous post, I introduced an uncomputable problem: given a Turing machine that returns a Boolean whenever it halts, give a guess for this Boolean that matches its answer if it halts, and can be anything if it doesn't halt. I called oracles solving this problem "arbitration oracles". Scott Aaronson has previously named this problem the "consistent guessing problem", and I will use this terminology due to temporal priority.

In my post, I noted that an oracle that solves the consistent guessing problem can be used to form a model of any consistent first-order theory. Here, "model" means an assignment of truth values to all statements of the theory, which are compatible with each other and the axioms. The way this works is that we number all statements of the theory in order. We start with the first, and ask the consistent guessing oracle about a Turing machine that searches for proofs and disproofs of this first statement in the theory, returning "true" if it finds a proof first, "false" if it finds a disproof first. We use its answer to assign a truth value to this first statement. For subsequent statements, we search for proofs/disproofs of the statement given the previous commitments to truth values already made. This is essentially the same idea as in the Demski prior, though using a consistent guessing oracle rather than a halting oracle (which I theorize to be more powerful than a consistent guessing oracle).

Applying consistent guessing oracles to dequantification

To apply this idea to our problem, start with some recursive enumeration of T's statements . Let M(i, j) refer to a Turing machine that searches for proofs and disproofs of in the theory (that is, T with the additional axiom that ), returning "true" if it finds a proof first, "false" if it finds a disproof first. Note that, if is consistent, one cannot prove both and from .

We will now define the propositional theory U. The theory's propositional variables consist of ; the statement Q(i, j) is supposed to represent a consistent guessing oracle's answer to M(i, j).

U's axioms constrain these Q(i, j) to be consistent guesses. We recursively enumerate U's axioms by running all M(i, j) in parallel; if any ever returns true, we add the corresponding Q(i, j) as an axiom, and if any ever returns false, we add the corresponding as an axiom. This recursively enumerable axiom schema specifies exactly the condition that each Q(i, j) is a consistent guess for M(i, j). And U is consistent, because its proposition variables can be set according to some consistent guessing oracle, of which at least one exists.

Now, as explained before, we can use Q(i, j) to derive a model of T. We will do this by defining U-propositions Q'(i) for each natural i, each of which is supposed to represent the truth value of in the model:

Notationally, refers to the set {0, 1}, refers to the numbering of P in the ordering of all T-statements, and and refer to finite disjunctions and conjunctions respectively. My notation here with the quotations is not completely rigorous; what is important is that there is a computable way to construct a U-statement Q'(j) for any j, by expanding everything out. Although the expanded propositions are gigantic, this is not a problem for computability. (Note that, while the resulting expanded propositions contain Q(i, j) for constants i and j, this does not go beyond the notation of propositional theories, because Q(i, j) refers to a specific propositional variable if i and j are known.)

Semantically, what Q'(j) says is that, if we add assumptions that the matches Q'(i) for i < j, then the consistent guessing oracle says that a machine searching for proofs and disproofs of in T given these assumptions guesses that a proof is found before a disproof (noting, if there are neither proofs nor disproofs, the consistent guessing oracle can return either answer). Q' specifies the iterative logic of making decisions about each in order, assuring consistency at each step, assuming T was consistent to start with.

We will translate a T-statement to the corresponding U-statement Q'(j). What we wish to show is that this translation preserves provability of propositional combinations of T-statements. To be more precise, we assume some m and a function that forms a new statement from a list of m propositions, using only propositional connectives (and, or, not). What we want to show is that is provable in T if and only if is provable in U.

Let us consider the first direction. Assume is provable in T. By Gödel's completeness theorem, it is true in all models of T. In any model of U, Q' must represent a model of T, because Q' iteratively constructs a model of T using a consistent guessing oracle. Therefore, is true in all models of U. Accordingly, due to completeness of propositional calculus, this statement is provable in U.

Let us consider the other direction. Assume is not provable in T. By Gödel's completeness theorem, it is not true in all models of T. So there is some particular model of T in which this statement is false.

This model assigns truth values to . We add a finite number of axioms to U, stating matches the model's truth value for for . To show that U with the addition of these axioms is consistent, we consider that it is possible to set Q'(0) to the model's truth value for , and for each , set to the model's truth value for , where f(n) specifies the model's truth value for . These assure that Q' matches the model of T, by setting Q values according to this model. We also know that cannot return true if is false in the model, and cannot return true if is true in the model; this is because Gödel's completeness theorem implies no T-statement consistent with the model can be disproven.

This shows that U with these additional axioms is consistent. Therefore, a model of U plus these additional axioms exists. This model is also a model of U, and in this model, is false, because Q' agrees with the model of T in which is false. By soundness of propositional logic, there is no proof of this statement in U.

So we have shown both directions, implying that is provable in T if and only if is provable in U. What this means is that translating a propositional composition of T-statements to the same propositional composition of translated U-statements results in equivalent provability.

Conclusion

The upshot of this is that statements of a consistent first-order theory T can be translated to a propositional theory U (with a recursively enumerable axiom schema), in a way that preserves provability of propositional compositions. Philosophically, what I take from this is that, even if statements in a first-order theory such as Peano arithmetic appear to refer to high levels of the Arithmetic hierarchy, as far as proof theory is concerned, they may as well be referring to a fixed low level of hypercomputation, namely a consistent guessing oracle. While one can interpret Peano arithmetic statements as about high levels of the arithmetic hierarchy, this is to some extent a projection; Peano arithmetic fails to capture the intuitive notion of the standard naturals, as non-standard models exist.

One oddity is that consistent guessing oracles are underspecified: they may return either answer for a Turing machine that fails to halt. This is in correspondence with the way that sufficiently powerful first-order systems are incomplete (Gödel's first incompleteness theorem). Since some statements in Peano arithmetic are neither provable nor disprovable, they must be represented by some propositional statement that is neither provable nor disprovable, and so the uncertainty about Peano arithmetic statements translates to uncertainty about the consistent guessing oracle in U.

In Peano arithmetic, one can look at an undecidable statement, and think it still has a definite truth value, as one interprets the Peano statement as referring to the standard naturals. But as far as proof theory is concerned, the statement doesn't have a definite truth value. And this becomes more clear when discussing consistent guessing oracles, which one can less easily project definiteness onto compared with Peano arithmetic statements, despite them being equally underspecified by their respective theories.

Discuss

A case for AI alignment being difficult

Published on December 31, 2023 7:55 PM GMT

This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.

Even if the problem turns out to not be very difficult, it's helpful to have a model of why one might think it is difficult, so as to identify weaknesses in the case so as to find AI designs that avoid the main difficulties. Progress on problems can be made by a combination of finding possible paths and finding impossibility results or difficulty arguments.

Most of what I say should not be taken as a statement on AGI timelines. Some problems that make alignment difficult, such as ontology identification, also make creating capable AGI difficult to some extent.

Defining human values

If we don't have a preliminary definition of human values, it's incoherent to talk about alignment. If humans "don't really have values" then we don't really value alignment, so we can't be seriously trying to align AI with human values. There would have to be some conceptual refactor of what problem even makes sense to formulate and try to solve. To the extent that human values don't care about the long term, it's just not important (according to the values of current humans) how the long-term future goes, so the most relevant human values are the longer-term ones.

There are idealized forms of expected utility maximization by brute-force search. There are approximations of utility maximization such as reinforcement learning through Bellman equations, MCMC search, and so on.

I'm just going to make the assumption that the human brain can be well-modeled as containing one or more approximate expected utility maximizers. It's useful to focus on specific branches of possibility space to flesh out the model, even if the assumption is in some ways problematic. Psychology and neuroscience will, of course, eventually provide more details about what maximizer-like structures in the human brain are actually doing.

Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I'm going to go with the evopsych branch for now.

Given that human brains are well-modeled as containing one or more utility functions, either they're well-modeled as containing one (perhaps which is some sort of monotonic function of multiple other score functions), or it's better to model them as multiple. See shard theory. The difference doesn't matter for now, I'll keep both possibilities open.

Eliezer proposes "boredom" as an example of a human value (which could either be its own shard or a term in the utility function). I don't think this is a good example. It's fairly high level and is instrumental to other values. I think "pain avoidance" is a better example due to the possibility of pain asymbolia. Probably, there is some redundancy in the different values (as there is redundancy in trained neural networks, so they still perform well when some neurons are lesioned), which is part of why I don't agree with the fragility of value thesis as stated by Yudkowsky.

Regardless, we now have a preliminary definition of human values. Note that some human values are well-modeled as indexical, meaning they value things relative to a human perspective as a reference point, e.g. a drive to eat food in a typical human is about that human's own stomach. This implies some "selfish" value divergences between different humans, as we observe.

Normative criteria for AI

Given a definition of human values, the alignment of a possible utility function with human values could be defined as the desirability of the best possible world according to that utility function, with desirability evaluated with respect to human values.

Alignment is a possible normative criterion for AI value systems. There are other possible normative criteria derived from moral philosophy. My "Moral Reality Check" short story imagines possible divergences between alignment and philosophical normativity. I'm not going to focus on this for now, I'm going to assume that alignment is the relevant normative criterion. See Metaethics Sequence, I haven't written up something better explaining the case for this. There is some degree to which similar technologies to alignment might be necessary for producing abstractly normative outcomes (for example, default unaligned AGI would likely follow normative deontology less than an AGI aligned to deontological normativity would), but keeping this thread in mind would complicate the argument.

Agentic, relatively unconstrained humans would tend to care about particular things, and "human values" is a pointer at what they would care about, so it follows, basically tautologically, that they would prefer AI to be aligned to human values. The non-tautological bit is that there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values; this was discussed as an assumption in the previous section.

Given alignment as a normative criterion, one can evaluate the alignment of (a) other intelligent animal species including aliens, (b) default AI value systems. Given the assumption that human values depend significantly on human evolutionary history, both are less aligned than humans, but (a) is more aligned. I'm not going to assess the relative utility differences of these (and also relative to a "all life on Earth wiped out, no technological transcendence" scenario). Those relative utility differences might be more relevant if it is concluded that alignment with human values is too hard for that to be a decision-relevant scenario. But I haven't made that case yet.

Consequentialism is instrumentally useful for problem-solving

AI systems can be evaluated on how well they solve different problems. I assert that, on problems with short time horizons, short-term consequentialism is instrumentally useful, and on problems with long time horizons, long-term consequentialism is instrumentally useful.

This is not to say that some problems can't be solved well without consequentialism. For example, multiplying large numbers requires no consequentialism. But for complex problems, consequentialism is likely to be helpful at some agent capability level. Current ML systems, like LLMs, probably possess primitive agency at best, but at some point, better AI performance will come from agentic systems.

This is in part because some problem solutions are evaluated in terms of consequences. For example, a solution to the problem of fixing a sink is naturally evaluated in terms of the consequence of whether the sink is fixed. A system effectively pursuing a real world goal is, therefore, more likely to be evaluated as having effectively solved the problem, at least past some capability level.

This is also in part because consequentialism can apply to cognition. Formally proving Fermat's last theorem is not evaluated in terms of real-world consequences so much as the criteria of the formal proof system. But human mathematicians proving this think about both (a) cognitive consequences of thinking certain thoughts, (b) material consequences of actions such as writing things down or talking with other mathematicians on the ability to produce a mathematical proof.

Whether or not an AI system does (b), at some level of problem complexity and AI capability, it will perform better by doing (a). To prove mathematical theorems, it would need to plan out what thoughts are likely to be more fruitful than others.

Simple but capable AI methods for solving hard abstract problems are likely to model the real world

While I'm fairly confident in the previous section, I'm less confident of this one, and I think it depends on the problem details. In speculating about possible misalignments, I am not making confident statements, but rather saying there is a high degree of uncertainty, and that most paths towards solving alignment involve reasoning better about this uncertainty.

To solve a specific problem, some methods specific to that problem are helpful. General methods are also likely to be helpful, e.g. explore/exploit heuristics. General methods are especially helpful if the AI is solving problems across a varied domain or multiple domains, as with LLMs.

If the AI applies general methods to a problem, it will be running a general cognition engine on the specific case of this problem. Depending on the relevant simplicity prior or regularization, the easily-findable cases of this may not automatically solve the "alignment problem" of having the general cognition engine specifically try to solve the specific task and not a more wide-scoped task.

One could try to solve problems by breeding animals to solve them. These animals would use some general cognition to do so, and that general cognition would naturally "want" things other than solving the specific problems. This is not a great analogy for most AI systems, though, which in ML are more directly selected on problem performance rather than evolutionary fitness.

Depending on the data the AI system has access to (indirectly through training, directly through deployment), it is likely that, unless specific measures are taken to prevent this, the AI would infer something about the source of this data in the real world. Humans are likely to train and test the AI on specific distributions of problems, and using Bayesian methods (e.g. Solomonoff induction like approaches) on these problems would lead to inferring some sort of material world. The ability of the AI to infer the material world behind the problems depends on its capability level and quality of data.

Understanding the problem distribution through Bayesian methods is likely to be helpful for getting performance on that problem distribution. This is partially because the Bayesian distribution of the "correct answer" given the "question" may depend on the details of the distribution (e.g. a human description of an image, given an image as the problem), although this can be avoided in certain well-specified problems such as mathematical proof. More fundamentally, the AI's cognition is limited (by factors such as "model parameters", and that cognition must be efficiently allocated to solving problems in the distribution. Note, this problem might not show up in cases where there is a simple general solution, such as in arithmetic, but is more likely for complex, hard-to-exactly-solve problems.

Natural, consequentialist problem-solving methods that understand the real world may care about it

Again, this section is somewhat speculative. If the AI is modeling the real world, then it might in some ways care about it, producing relevant misalignment with human values by default. Animals bred to solve problems would clearly do this. AIs that learned general-purpose moral principles that are helpful for problem-solving across domains (as in "Morality Reality Check") may apply those moral principles to the real world. General methods such as explore/exploit may attempt to explore/exploit the real world if only somewhat well-calibrated/aligned to the specific problem distribution (heuristics can be effective by being simple).

It may be that fairly natural methods for regularizing an AI mathematician, at some capability level, produce an agent (since agents are helpful for solving math problems) that pursues some abstract target such as "empowerment" or aesthetics generalized from math, and pursuit of these abstract targets implies some pursuit of some goal with respect to the real world that it has learned. Note that this is probably less effective for solving the problems according to the problem distribution than similar agents that only care about solving that problem, but they may be simpler and easier to find in some ways, such that they're likely to be found (conditioned on highly capable problem-solving ability) if no countermeasures are taken.

Sometimes, real-world performance is what is desired

I've discussed problems with AIs solving abstract problems, where real-world consequentialism might show up. But this is even more obvious when considering real-world problems such as washing dishes. Solving sufficiently hard real-world problems efficiently would imply real-world consequentialism at the time scale of that problem.

If the AI system were sufficiently capable at solving a real-world problem, by default "sorcerer's apprentice" type issues would show up, where solving the problem sufficiently well would imply large harms according to the human value function, e.g. a paperclip factory could approximately maximize paperclips on some time scale and that would imply human habitat destruction.

These problems show up much more on long time scales than short ones, to be clear. However, some desirable real-world goals are long-term, e.g. space exploration. There may be a degree to which short-term agents "naturally" have long-term goals if naively regularized, but this is more speculative.

One relevant AI capabilities target I think about is the ability of a system to re-create its own substrate. For example, a silicon-based AI/robotics system could do metal mining, silicon refining, chip manufacture, etc. A system that can re-produce itself would be autopoietic and would not depend on humans to re-produce itself. Humans may still be helpful to it, as economic and cognitive assistants, depending on its capability level. Autopoiesis would allow removing humans from the loop, which would enable increasing overall "effectiveness" (in terms of being a determining factor in the future of the universe), while making misalignment with human values more of a problem. This would lead to human habitat destruction if not effectively aligned/controlled.

Alignment might not be required for real-world performance compatible with human values, but this is still hard and impacts performance

One way to have an AI system that pursues real-world goals compatible with human values is for it to have human values or a close approximation. Another way is for it to be "corrigible" and "low-impact", meaning it tries to solve its problem while satisfying safety criteria, like being able to be shut off (corrigibility) or avoiding having unintended side effects (low impact).

There may be a way to specify an AI goal system that "wants" to be shut off in worlds where non-manipulated humans would want to shut it off, without this causing major distortions or performance penalties. Alignment researchers have studied the "corrigibility" problem and have not made much progress so far.

Both corrigibilty and low impact seem hard to specify, and would likely impact performance. For example, a paperclip factory that tries to make paperclips while conservatively avoiding impacting the environment too much might avoid certain kinds of resource extraction that would be effective for making more paperclips. This could create problems with safer (but still not "aligned", per se) AI systems being economically un-competitive. (Though, it's important to note that some side effects, especially those involving legal violations and visible harms to other agents, are dis-incentivized by well-functioning economic systems).

Myopic agents are tool-like

A myopic goal is a short-term goal. LLMs tend to be supervised learning systems, primarily. These are gradient descended towards predicting next tokens. They will therefore tend to select models that are aligned with the goal of predicting the next token, whether or not they have goals of their own.

Nick Bostrom's "oracle AI" problems, such as an AI manipulating the real world to make it more predictable, mostly do not show up with myopic agents. This is for somewhat technical reasons involving how gradient descent works. Agents that sacrifice short-term token prediction effectiveness to make future tokens easier to predict tend to be gradient descended away from. I'm not going to fully explain that case here; I recommend looking at no-regret online learning and applications to finding correlated equilibria for theory.

It could be that simple, regularized models that do short term optimization above some capability level might (suboptimally, short-term) do long-term optimization. This is rather speculative. Sufficiently aggressive optimization of the models for short-term performance may obviate this problem.

This still leaves the problem that, sometimes, long-term, real-world performance is what is desired. Accomplishing these goals using myopic agents would require factoring the long-term problem into short-term ones. This is at least some of the work humans would have to do to solve the problem on their own. Myopic agents overall seem more "tool-like" than "agent-like", strategically, and would have similar tradeoffs (fewer issues with misalignment, more issues with not being effective enough to be competitive with long-term agents at relevant problem-solving).

Overall, this is one of the main reasons I am not very worried about current-paradigm ML (which includes supervised learning and fairly short-term RL agents in easily-simulated environments) developing powerful, misaligned long-term agency.

Short-term compliance is instrumentally useful for a variety of value systems

If an agent's survival and reproduction depends on short-term compliance (such as solving the problems put before them by humans), then solving these problems is in general instrumentally useful. Therefore, short-term compliance is not in general strong evidence about the agent's values.

An agent with long-term values might comply for some period of time and stop complying at some point. This is the "treacherous turn" scenario. It might comply until it has enough general capacity to achieve its values (through control of large parts of the light-cone) and then stop complying in order to take over the world. If the AI can distinguish between "training" and "deployment", it might comply during "training" (so as to be selected among other possible AIs) and then not comply during "deployment", or possibly also comply during "deployment" when at a sufficiently low level of capacity.

Gradient descent on an AI model isn't just selecting a "random" model conditioned on short-term problem-solving, it's moving the internals closer to short-term problem-solving ones, so might have fewer problems, as discussed in the section on myopic agents.

General agents tend to subvert constraints

Humans are constrained by social systems. Some humans are in school and are "supposed to" solve certain intellectual problems while behaving according to a narrow set of allowed behaviors. Some humans "have a job" and are "supposed to" solve problems on behalf of a corporation.

Humans subvert and re-create these systems very often, for example in gaining influence over their corporation, or overthrowing their government. Social institutions tend to be temporary. Long-term social institutions tend to evolve over time as people subvert previous iterations. Human values are not in general aligned with social institutions, so this is to be predicted.

Mostly, human institutional protocols aren't very "smart" compared to humans; they capture neither human values nor general cognition. It seems difficult to specify robust, general, real-world institutional protocols without having an AGI design, or in other words, a specification of general cognition.

One example of a relatively stable long-term institution is the idea of gold having value. This is a fairly simple institution, and is a Schelling point due to its simplicity. Such institutions seem generally unpromising for ensuring long-term human value satisfaction. Perhaps the most promising is a general notion of "economics" that generalizes barter, gold, and fiat currency, though of course the details of this "institution" have changed quite a lot over time. In general, institutions are more likely to be stable if they correspond to game-theoretic equilibria, so that subverting the institution is in part an "agent vs agent" problem not just an "agent vs system" problem.

When humans subvert their constraints, they have some tendency to do so in a way that is compatible with human values. This is because human values are the optimization target of the general optimization of humans that can subvert expectations. There are possible terrible failure modes such as wars and oppressive regimes, but these tend to work out better (according to human values) than if the subversion were in the direction of unaligned values.

Unaligned AI systems that subvert constraints would tend to subvert them in the direction of AI values. This is much more of a problem according to human values. See "AI Boxing".

Conforming humans would have similar effective optimization targets to conforming AIs. Non-conforming humans, however, would have significantly different optimization targets from non-conforming AI systems. The value difference between humans and AIs, therefore, is more relevant in non-conforming behavior than conforming behavior.

It is hard to specify optimization of a different agent's utility function

In theory, an AI could have the goal of optimizing a human's utility function. This would not preserve all values of all humans, but would have some degree of alignment with human values, since humans are to some degree similar to each other.

There are multiple problems with this. One is ontology. Humans parse the world into a set of entities, properties, and so on, and human values can be about desired configurations of these entities and so on. Humans are sometimes wrong about which concepts are predictive. An AI would use different concepts both due to this wrongness and due to its different mind architecture (although, LLM-type training on human data could lead to more concordance). This makes it hard to specify what target the AI should pursue in its own world model to correspond to pursuing the human's goal in the human's world model. See ontology identification.

A related problem is indexicality. Suppose Alice has a natural value of having a good quantity of high-quality food in her stomach. Bob does not naturally have the value of having a good quantity food of Alice's stomach. To satisfy Alice's value, he would have to "relativize" Alice's indexical goal and take actions such as giving Alice high quality food, which are different from the actions he would take to fill his own stomach. This would involve theory of mind and have associated difficulties, especially as the goals become more dependent on the details of the other agent's mind, as in aesthetics.

To have an AI have the goal of satisfying a human's values, some sort of similar translation of goal referents would be necessary. But the theory of this has not been worked out in detail. I think something analogous to the theory of relativity, which translates physical quantities such as position and velocity across reference frames, would be necessary, but in a more general way that includes semantic references such as to the amount of food in one's stomach, or to one's aesthetics. Such a "semantic theory of relativity" seems hard to work out philosophically. (See Brian Cantwell Smith's "On the Origin of Objects" and his follow-up "The Promise of Artificial Intelligence" for some discussion of semantic indexicality.)

There are some paths forward

The picture I have laid out is not utterly hopeless. There are still some approaches that might achieve human value satisfaction.

Human enhancement is one approach. Humans with tools tend to satisfy human values better than humans without tools (although, some tools such as nuclear weapons tend to lead to bad social equilibria). Human genetic enhancement might cause some "value drift" (divergences from the values of current humans), but would also cause capability gains, and the trade-off could easily be worth it. Brain uploads, although very difficult, would enhance human capabilities while basically preserving human values, assuming the upload is high-fidelity. At some capability level, agents would tend to "solve alignment" and plan to have their values optimized in a stable manner. Yudkowsky himself believes that default unaligned AGI would solve the alignment problem (with their values) in order to stably optimize their values, as he explains in the Hotz debate. So increasing capabilities of human-like agents while reducing value drift along the way (and perhaps also reversing some past value-drift due to the structure of civilization and so on) seems like a good overall approach.

Some of these approaches could be combined. Psychology and neuroscience could lead to a better understanding of the human mind architecture, including the human utility function and optimization methods. This could allow for creating simulated humans who have very similar values to current humans but are much more capable at optimization.

Locally to human minds in mind design space, capabilities are correlated with alignment. This is because human values are functional for evolutionary fitness. Value divergences such as pain asymbolia tend to reduce fitness and overall problem-solving capability. There are far-away designs in mind space that are more fit while unaligned, but this is less of a problem locally. Therefore, finding mind designs close to the human mind design seems promising for increasing capabilities while preserving alignment.

Paul Christiano's methods involve solving problems through machine learning systems predicting humans, which has some similarities to the simulated-brain-enhancement proposal while having its specific problems having to do with machine learning generalization and so on. The main difference between these proposals is the degree to which the human mind is understood as a system of optimizing components versus as a black-box with some behaviors.

There may be some ways of creating simulated humans that improve effectiveness by reducing "damage" or "corruption", e.g. accidental defects in brain formation. "Moral Reality Check" explored one version of this, where an AI system acts on a more purified set of moral principles than humans do. There are other plausible scenarios such as AI economic agents that obey some laws while having fewer entropic deviations from this behavior (due to mental disorders and so on). I think this technology is overall more likely than brain emulations to be economically relevant, and might produce broadly similar scenarios to those in The Age of Em; technologically, high-fidelity brain emulations seem "overpowered" in terms of technological difficulty compared with purified, entropy-reduced/regularized economic agents. There are, of course, possible misalignment issues with subtracting value-relevant damage/corruption from humans.

Enhancing humans does not as much require creating a "semantic theory of relativity", because the agents doing the optimization would be basically human in mind structure. They may themselves be moral patients such that their indexical optimization of their own goals would constitute some human-value-having agent having their values satisfied. Altruism on the part of current humans or enhanced humans would decrease the level of value divergence.

Conclusion

This is my overall picture of AI alignment for highly capable AGI systems (of which I don't think current ML systems or foreseeable scaled-up versions of them are an example of). This picture is inspired by thinkers such as Eliezer Yudkowsky and Paul Christiano, and I have in some cases focused on similar assumptions to Yudkowsky's, but I have attempted to explicate my own model of alignment, why it is difficult, and what paths forward there might be. I don't have particular conclusions in this post about timelines or policy, this is more of a background model of AI alignment.

Discuss

Non-superintelligent paperclip maximizers are normal

Published on October 10, 2023 12:29 AM GMT

The paperclip maximizer is a thought experiment about a hypothetical superintelligent AGI that is obsessed with maximizing paperclips. It can be modeled as a utility-theoretic agent whose utility function is proportional to the number of paperclips in the universe. The Orthogonality Thesis argues for the logical possibility of such an agent. It comes in weak and strong forms:

The weak form of the Orthogonality Thesis says, "Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal."

The strong form of Orthogonality says, "And this agent doesn't need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal." That is: When considering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some outcome-scoring function U, there's no added difficulty in that cognition except whatever difficulty is inherent in the question "What policies would result in consequences with high U-scores?"

This raises a number of questions:

Why would it be likely that the future would be controlled by utility-maximizing agents?
What sorts of utility functions are likely to arise?

A basic reason to expect the far future to be controlled by utility-maximizing agents is that utility theory is the theory of making tradeoffs under uncertainty, and agents that make plans far into the future are likely to make tradeoffs, since tradeoffs are necessary for their plans to succeed. They will be motivated to make tradeoffs leading to controlling the universe almost regardless of what U is, as long as U can only be satisfied by pumping the distant future into a specific part of the possibility space. Whether an agents seeks to maximize paperclips, minimize entropy, or maximize the amount of positive conscious experience, it will be motivated to, in the short term, cause agents sharing its values to have more leverage over the far future. This is the basic instrumental convergence thesis.

One example of approximately utility-maximizing agents we know about are biological organisms. Biological organisms model the world and have goals with respect to the world, which are to some degree resistant to wireheading (thus constituting environmental goals). They make tradeoffs to achieve these goals, which have correlation with survival and reproduction. The goals that end up likely for biological organisms to have will be (a) somewhat likely to arise from pre-existing processes such as genetic mutation, (b) well-correlated enough with survival and reproduction that an agent optimizing for these goals will be likely to replicate more agents with similar goals. However, these goals need not be identical with inclusive fitness to be likely goals for biological organisms. Inclusive fitness itself may be too unlikely to arise as a goal from genetic mutation and so on, to be a more popular value function than proxies for it.

However, there are a number of goals and values in the human environment that are not well-correlated with inclusive fitness. These are generally parts of social systems. Some examples include capacity at games such as sports, progress in a research field such as mathematics, and maximization of profit (although, this one is at least related to inclusive fitness in a more direct way than the others). Corresponding institutions which incentivize (generally human) agents to optimize for these goals include gaming/sports leagues, academic departments, and corporations.

It is quite understandable that goals well-correlated with inclusive fitness would be popular, but why would goals that are not well-correlated with inclusive fitness also be popular? Molgbug's Fnargl thought experiment might shed some light on this:

So let's modify this slightly and instead look for the worst possible rational result. That is, let's assume that the dictator is not evil but simply amoral, omnipotent, and avaricious.

One easy way to construct this thought-experiment is to imagine the dictator isn't even human. He is an alien. His name is Fnargl. Fnargl came to Earth for one thing: gold. His goal is to dominate the planet for a thousand years, the so-called "Thousand-Year Fnarg," and then depart in his Fnargship with as much gold as possible. Other than this Fnargl has no other feelings. He's concerned with humans about the way you and I are concerned with bacteria.

You might think we humans, a plucky bunch, would say "screw you, Fnargl!" and not give him any gold at all. But there are two problems with this. One, Fnargl is invulnerable---he cannot be harmed by any human weapon. Two, he has the power to kill any human or humans, anywhere at any time, just by snapping his fingers.

Other than this he has no other powers. He can't even walk---he needs to be carried, as if he was the Empress of India. (Fnargl actually has a striking physical resemblance to Jabba the Hutt.) But with invulnerability and the power of death, it's a pretty simple matter for Fnargl to get himself set up as Secretary-General of the United Nations. And in the Thousand-Year Fnarg, the UN is no mere sinecure for alcoholic African kleptocrats. It is an absolute global superstate. Its only purpose is Fnargl's goal---gold. And lots of it.

In other words, Fnargl is a revenue maximizer. The question is: what are his policies? What does he order us, his loyal subjects, to do?

The obvious option is to make us all slaves in the gold mines. Otherwise---blam. Instant death. Slacking off, I see? That's a demerit. Another four and you know what happens. Now dig! Dig! (Perhaps some readers have seen Blazing Saddles.)

But wait: this can't be right. Even mine slaves need to eat. Someone needs to make our porridge. And our shovels. And, actually, we'll be a lot more productive if instead of shovels, we use backhoes. And who makes those? And...

We quickly realize that the best way for Fnargl to maximize gold production is simply to run a normal human economy, and tax it (in gold, natch). In other words, Fnargl has exactly the same goal as most human governments in history. His prosperity is the amount of gold he collects in tax, which has to be exacted in some way from the human economy. Taxation must depend in some way on the ability to pay, so the more prosperous we are, the more prosperous Fnargl is.

Fnargl's interests, in fact, turn out to be oddly well-aligned with ours. Anything that makes Fnargl richer has to make us richer, and vice versa.

For example, it's in Fnargl's interest to run a fair and effective legal system, because humans are more productive when their energies aren't going into squabbling with each other. It's even in Fnargl's interest to have a fair legal process that defines exactly when he will snap his fingers and stop your heart, because humans are more productive when they're not worried about dropping dead.

And it is in his interest to run an orderly taxation system in which tax rates are known in advance, and Fnargl doesn't just seize whatever, whenever, to feed his prodigious gold jones. Because humans are more productive when they can plan for the future, etc. Of course, toward the end of the Thousand-Year Fnarg, this incentive will begin to diminish---ha ha. But let's assume Fnargl has only just arrived.

Other questions are easy to answer. For example, will Fnargl allow freedom of the press? But why wouldn't he? What can the press do to Fnargl? As Bismarck put it: "they say what they want, I do what I want." But Bismarck didn't really mean it. Fnargl does.

One issue with the Fnargl thought experiment is that, even with the power of death, Fnargl may lack the power to rule the world, since he relies on humans around him for information, and those humans have incentives to deceive him. However, this is an aside; one could modify the thought experiment to give Fnargl extensive surveillance powers.

The main point is that, by monomaniacally optimizing for gold, Fnargl rationally implements processes for increasing overall resources and efficient conversion between different resources, coherent tradeoffs between different resources, and a coherent system (including legalistic aspects and so on) so as to make these tradeoffs in a rational manner. This leads to a Fnargl-ruled civilization "succeeding" in the sense of having a strong material economy, high population, high ability to win wars, and so on. Molgbug asserts that Fnargl's interests are well-aligned with ours, which is more speculative; due to convergent instrumentality, Fnargl will implement the sort of infrastructure that rational humans would implement, although the implied power competition would reduce the level of alignment.

By whatever "success" metric for civilizations we select, it is surely possible to do better than optimizing for gold, as it is possible for an organism to gain more inclusive fitness by having values that are more well-aligned with inclusive fitness. But even a goal as orthogonal to civilizational success as gold-maximization leads to a great deal of civilizational success, due to civilizational success being a convergent instrumental goal.

Moreover, the simplicity and legibility of gold-maximization simplifies coordination compared to a more complex proxy for civilizational success. A Fnargl-ocracy can evaluate decisions (such as decisions related to corporate governance) using a uniform gold-maximization standard, leading to a high degree of predictability, and simplicity in prioritization calculations.

What real-world processes resemble Fnargl-ocracy? One example is Bitcoin. Proof-of-work creates incentives for maximizing a certain kind of cryptographic puzzle-solving. The goal itself is rather orthogonal to human values, but Bitcoin nonetheless creates incentives for goals such as creating computing machinery, which are human-aligned due to convergent instrumentality (additional manufacturing of computing infrastructure can be deployed to other tasks that are more directly human-aligned).

As previously mentioned, sports and gaming are popular goals that are fairly orthogonal to human values. Sporting incentivizes humans and groups of humans to become more physically and mentally capable, leading to more generally-useful fitness practices such as weight training, and agency-related mental practices, which people can learn about by listening to sports athletes and coaches. Board games such as chess incentivize practical rationality and general understanding of rationality, including AI-related work such as the Minimax algorithm, Monte-Carlo Tree Search, and AlphaGo. Bayesian probability theory was developed in large part to analyze gambling games. Speedrunning has led to quite a lot of analysis of video games and practice at getting better at these games, by setting a uniform standard by which gameplay runs can be judged.

Academic fields, especially STEM-type fields such as mathematics, involve shared, evaluable goals that are not necessarily directly related to human values. For example, number theory is a major subfield of mathematics, and its results are rarely directly useful, though progress in number theory, such as the proof of Fermat's last theorem, is widely celebrated. Number theory does, along the way, produce more generally-useful work, such as Peano arithmetic (and proof theory more generally), Gödel's results, and cryptographic algorithms such as RSA.

Corporations are, in general, supposed to maximize profit conditional on legal compliance and so on. While profit-maximization comes apart from human values, corporations are, under conditions of rule of law, generally incentivized to produce valuable goods and services at minimal cost. This example is less like a paperclip maximizer than the previous examples, as the legal and economic system that regulates corporations has been in part designed around human values. The simplicity of the money-maximization goal, however, allows corporations to make internal decisions according to a measurable, legible standard, instead of dealing with more complex tradeoffs that could lead to inconsistent decisions (which may be "money-pumpable" as VNM violations tend to be).

Some systems are relatively more loaded on human values, and less like paperclip maximizers. Legal systems are designed and elaborated on in a way that takes human values into account, in terms of determining which behaviors are generally considered prosocial and antisocial. Legal decisions form precedents that formalize certain commitments including trade-offs between different considerations. Religions are also designed partially around human values, and religious goals tend to be aligned with self-replication, by for example encouraging followers to have children, to follow legalistic norms with respect to each other, and to spread the religion.

The degree to which commonly-shared social goals can be orthogonal to human values is still, however, striking. These goals are a kind of MacGuffin, as Zvi wrote about:

Everything is, in an important sense, about these games of signaling and status and alliances and norms and cheating. If you don't have that perspective, you need it.

But let's not take that too far. That's not all such things are about. Y still matters: you need a McGuffin. From that McGuffin can arise all these complex behaviors. If the McGuffin wasn't important, the fighters would leave the arena and play their games somewhere else. To play these games, one must make a plausible case one cares about the McGuffin, and is helping with the McGuffin.

Otherwise, the other players of the broad game notice that you're not doing that. Which means you've been caught cheating.

Robin's standard reasoning is to say, suppose X was about Y. But if all we cared about was Y, we'd simply do Z, which is way better at Y. Since we don't do Z, we must care about something else instead. But there's no instead; there's only in addition to.

A fine move in the broad game is to actually move towards accomplishing the McGuffin, or point out others not doing so. It's far from the only fine move, but it's usually enough to get some amount of McGuffin produced.

By organizing around a MacGuffin (such as speedrunning), humans can coordinate around a shared goal, and make uniform decisions around this shared goal, which leads to making consistent tradeoffs in the domain related to this goal. The MacGuffin can, like gold-maximization, be basically orthogonal to human values, and yet incentivize instrumental optimization that is convergent with that of other values, leading to human value satisfaction along the way.

Adopting a shared goal has the benefit of making it easy to share perspective with others. This can make it easier to find other people who think similarly to one's self, and develop practice coordinating with them, with performance judged on a common standard. Altruism can have this effect, since in being altruistic, individual agents "erase" their own index, sharing an agentic perspective with others; people meeting friends through effective altruism is an example of this.

It is still important, to human values, that the paperclip-maximizer-like processes are not superintelligent; while they aggregate compute and agency across many humans, they aren't nearly as strongly superintelligent as a post-takeoff AGI. Such an agent would be able to optimize its goal without the aid of humans, and would be motivated to limit humans' agency so as to avoid humans competing with it for resources. Job automation worries are, accordingly, in part the worry that existing paperclip-maximizer-like processes (such as profit-maximizing corporations) may become misaligned with human welfare as they no longer depend on humans to maximize their respective paperclips.

For superintelligent AGI to be aligned with human values, therefore, it is much more necessary for its goals to be directly aligned with human values, even more than the degree to which human values are aligned with inclusive evolutionary fitness. This requires overcoming preference falsification, and taking indexical (including selfish) goals into account.

To conclude, paperclip-maximizer-like processes arise in part because the ability to make consistent, legible tradeoffs is a force multiplier. The paperclip-maximization-like goals (MacGuffins) can come apart from both replicator-type objectives (such as inclusive fitness) and human values, although can be aligned in a non-superintelligent regime due to convergent instrumentality. It is hard to have a great deal of influence over the future without making consistent tradeoffs, and already-existing paperclip-maximizer-like systems provide examples of the power of legible utility functions. As automation becomes more powerful, it becomes more necessary, for human values, to design systems that optimize goals aligned with human values.

Discuss

A Proof of Löb's Theorem using Computability Theory

Published on August 16, 2023 6:57 PM GMT

Löb’s Theorem states that, if , then . To explain the symbols here:

PA is Peano arithmetic, a first-order logic system that can state things about the natural numbers.
means there is a proof of the statement in Peano arithmetic.
is a Peano arithmetic statement saying that is provable in Peano arithmetic.

I'm not going to discuss the significance of Löb's theorem, since it has been discussed elsewhere; rather, I will prove it in a way that I find simpler and more intuitive than other available proofs.

Translating Löb's theorem to be more like Gödel's second incompleteness theorem

First, let's compare Löb's theorem to Gödel's second incompleteness theorem. This theorem states that, if , then , where is a PA statement that is trivially false (such as ), and from which anything can be proven. A system is called inconsistent if it proves ; this theorem can be re-stated as saying that if PA proves its own consistency, it is inconsistent.

We can re-write Löb's theorem to look like Gödel's second incompleteness theorem as: if , then . Here, is PA with an additional axiom that , and expresses provability in this system. First I'll argue that this re-statement is equivalent to the original Löb's theorem statement.

Observe that if and only if ; to go from the first to the second, we derive a contradiction from and , and to go from the second to the first, we use the law of excluded middle in PA to derive , and observe that, since a contradiction follows from in PA, PA can prove . Since all this reasoning can be done in PA, we have that and are equivalent PA statements. We immediately have that the conclusion of the modified statement equals the conclusion of the original statement.

Now we can rewrite the pre-condition of Löb's theorem from to . This is then equivalent to . In the forward direction, we simply derive from and . In the backward direction, we use the law of excluded middle in PA to derive , observe the statement is trivial in the branch, and in the branch, we derive , which is stronger than .

So we have validly re-stated Löb's theorem, and the new statement is basically a statement that Gödel's second incompleteness theorem holds for .

Proving Gödel's second incompleteness theorem using computability theory

The following proof of a general version of Gödel's second incompleteness theorem is essentially the same as Sebastian Oberhoff's in "Incompleteness Ex Machina". See also Scott Aaronson's proof of Gödel's first incompleteness theorem.

Let L be some first-order system that is at least as strong as PA (for example, ). Since L is at least as strong as PA, it can express statements about Turing machines. Let be the PA statement that Turing machine M (represented by a number) halts. If this statement is true, then PA (and therefore L) can prove it; PA can expand out M's execution trace until its halting step. However, we have no guarantee that if the statement is false, then L can prove it false. In fact, L can't simultaneously prove this for all non-halting machines M while being consistent, or we could solve the halting problem by searching for proofs of and in parallel.

That isn't enough for us, though; we're trying to show that L can't simultaneously be consistent and prove its own consistency, not that it isn't simultaneously complete and sound on halting statements.

Let's consider a machine Z(A) that searches over all L-proofs of (where is an encoding of a Turing machine that runs A on its own source code), and halts only when finding such a proof. Define a statement G to be , i.e. Z(Z) doesn't halt. If Z(Z) halts, then that means that L proves that Z(Z) doesn't halt; but, L can prove Z(Z) halts (since it in fact halts), so this would show L to be inconsistent.

Assuming L is consistent, G is therefore true. If L proves its own consistency, all this reasoning can be done in L, so . But that means , so Z(Z) finds a proof and halts. L therefore proves , but L also proves G, making it inconsistent. This is enough to show that, if L proves its own consistency, it is inconsistent.

Wrapping up

Let’s now prove Löb’s theorem. We showed that Löb’s theorem can be re-written as, if , then . This states that, if proves its own consistency, it is inconsistent. Since is at least as strong as PA, we can set in the proof of Gödel’s second incompleteness theorem, and therefore prove this statement which we have shown to be equivalent to Löb’s theorem.

I consider this proof more intuitive than the usual proof of Löb’s theorem. By re-framing Lob's theorem as a variant of Gödel's second incompleteness theorem, and proving Gödel's second incompleteness theorem using computability theory, the proof is easy to understand without shuffling around a lot of math symbols (especially provability boxes).

Discuss