Normal view

There are new articles available, click to refresh the page.
Before yesterdayAI

[Aspiration-based designs] Outlook: dealing with complexity

Published on April 28, 2024 1:06 PM GMT

Summary. This teaser post sketches our current ideas for dealing with more complex environments. It will ultimately be replaced by one or more longer posts describing these in more detail. Reach out if you would like to collaborate on these issues.

Multi-dimensional aspirations

For real-world tasks that are specified in terms of more than a single evaluation metric, e.g., how much apples to buy and how much money to spend at most, we can generalize Algorithm 2 as follows from aspiration intervals to convex aspiration sets:

  • Assume there are  many evaluation metrics , combined into a vector-valued evaluation metric 
  • Preparation: Pick  many linear combinations  in the space spanned by these metrics so that their convex hull is full-dimensional and contains the origin, and consider the  many policies  each of which maximizes the expected value of the corresponding function . Let  and  be the expected values of  when using  in state  or after using action  in state , respectively (see Fig. 1). Let the admissibility simplices  and  be the simplices spanned by the vertices  and , respectively (red and violet triangles in Fig. 1). They replace the feasibility intervals used in Algorithm 2. 
  • Policy: Given a convex state-aspiration set  (central green polyhedron in Fig. 1), compute its midpoint (centre of mass)  and consider the  segments  from  to the corners  of  (dashed black lines in Fig. 1). For each of these segments , let  be the (nonempty!) set of actions for which  intersects . For each , compute the action-aspiration  by shifting a copy  of  along  towards  until the intersection of  and  is contained in the intersection of  and  (half-transparent green polyhedra in Fig. 1), and then intersecting  with  to give  (yellow polyhedra in Fig. 1). Then pick one candidate action from each  and randomize between these  actions in proportions so that the corresponding convex combination of the sets  is included in . Note that this is always possible because  is in the convex hull of the sets  and the shapes of the sets  "fit" into  by construction.
  • Aspiration propagation: After observing the successor state , the action-aspiration  is rescaled linearly from  to  to give the next state-aspiration , see Fig. 2. 

(We also consider other variants of this general idea) 

Fig. 1: Admissibility simplices, and construction of action-aspirations by shifting towards corners and intersecting with action admissibility simplices (see text for details).
Fig. 2: An action admissibility simplex  is the convex combination of the successor states' admissibility simplices , mixed in proportion to the respective transition probabilities . An action aspiration  can be rescaled to a successor state aspiration  by first mapping the corners of the action admissibility sets onto each other (dashed lines) and extending this map linearly.

Hierarchical decision making

A common way of planning complex tasks is to decompose them into a hierarchy of two or more levels of subtasks. Similar to existing approaches from hierarchical reinforcement learning, we imagine that an AI system can make such hierarchical decisions as depicted in the following diagram (shown for only two hierarchical levels, but obviously generalizable to more levels):

Fig. 3: Hierarchical world model in the case of two hierarchical levels of decision making.


Discuss

[Aspiration-based designs] 3. Performance and safety criteria, and aspiration intervals

Published on April 28, 2024 1:04 PM GMT

Summary. In this post, we extend the basic algorithm by adding criteria for choosing the two candidate actions the algorithm mixes, and by generalizing the goal from making the expected Total equal a particular value to making it fall into a particular interval. We only use simple illustrative examples of performance and safety criteria and reserve the discussion of more useful criteria for later posts.  

Introduction: using the gained freedom to increase safety

After having introduced the basic structure of our decision algorithms in the last post, in this post we will focus on the core question: How shall we make use of the freedom gained from having aspiration-type goals rather than maximization goals? 

After all, while there is typically only a single policy that maximize some objective function (or very few, more or less equivalent policies), there is typically a much larger set of policies that fulfill some constraints (such as the aspiration to make the expected Total equal some desired value). 

More formally: Let us think of the space of all (probabilistic) policies, , as a compact convex subset of a high-dimensional vector space with dimension  and Lebesgue measure . Let us call a policy  successful iff it fulfills the specified goal, , and let  be the set of successful policies. Then this set has typically zero measure, , and low dimension, , if the goal is a maximization goals, but it has large dimension, , for most aspiration-type goals. 

E.g., if the goal is to make expected Total equal an aspiration value, , we typically have  but still . At the end of this post, we discuss how the set of successful policies can be further enlarged by switching from aspiration values to aspiration intervals to encode goals, which makes the set have full dimension, , and positive measure, .

What does that mean? It means we have a lot of freedom to choose the actual policy  that the agent should use to fulfill an aspiration-type goal. We can try to use this freedom to choose policies that promise to be rather safe than unsafe according to some generic safety metric, similar to the impact metrics used in reward function regularization for maximizers.

Depending on the type of goal, we might also want to use this freedom to choose policies that fulfill the goal in a rather desirable than undesirable way according to some goal-related performance metric

In this post, we will illustrate this with only very few, "toy" safety metrics, and one rather simple goal-related performance metric, to exemplify how such metrics might be used in our framework. In a later post, we will then discuss more sophisticated and hopefully more useful safety metrics. 

Let us begin with a simple goal-related performance metric since that is the most straightforward.

Simple example of a goal-related performance metric 

Recall that in step 2 of the basic algorithm, we could make the agent pick any action  whose action-aspiration is at most as large as the current state-aspiration, , and it can also pick any other action, , whose action-aspiration is at least as large as the current state-aspiration, . This flexibility is because in steps 3 and 4 of the algorithm, the agent is still able to randomize between these two actions  in a way that makes expected Total, , become exactly .

If one had an optimization mindset, one might immediately get the idea to not only match the desired expectation for the Total, but also to minimize the variability of the Total, as measured by some suitable statistic such as its variance. In a sequential decision making situation like an MDP, estimating the variance of the Total requires a recursive calculation that anticipates future actions, which can be done but is not trivial. We reserve this for a later post.

Let us instead look at a simpler metric to illustrate the basic approach: the (relative, one-step) squared deviation of aspiration, which is very easy to compute:

The rationale for this is that at any time, the action aspirations  are the expected values of Total-to-go conditional on taking action , so keeping them close to  will tend to lead to a smaller variance. Indeed, that variance is lower bounded by , where  is the probability for choosing  calculated in step 3 of Algorithm 1.  

If the action space is rich enough, there will often be at least one action the action-aspiration of which equals the state-aspiration, , because the state-aspiration is contained in that action's feasibility interval, . There might even be a large number of such actions. This will in particular be the case in the early steps of an episode. This is because often one can distribute the amount of effort spent on a task more or less flexibly over time; as long as there is enough time left, one might start by exerting little effort and then make up for this in later steps, or begin with a lot of effort and then relax later. 

If this is the case, then minimizing  simply means choosing one of the actions for which  and thus  and thus , and then put . When there are many such candidate actions , that will still leave us with some remaining freedom for incorporating some other safety criteria to choose between them, maybe deterministically. One might thus think that this form of optimization is safe enough and should be performed because it gets rid of all variability and randomization in that step. 

For example: Assume the goal is to get back home with fourteen apples within a week, and the agent can harvest or eat at most six apples on each day (apples are really scarce). Then each day the agent might choose to harvest or eat any number of apples, as long as its current stock of apples does not deviate from fourteen by more than six times the remaining number of days. Only in the last several days it might then have to harvest or eat exactly six apples per day, to make up for earlier deviations and to land at fourteen apples sharp eventually.   

But there's at least two objections to minimizing SDA. First, generically, we can not be sure enough that there will indeed be many different actions for which , and so restricting our choice to the potentially only few actions that fulfill this might not allow us to incorporate safety criteria to a sufficient level. In particular, we should expect that the latter is often the case in the final steps of an episode, where there might at best be a single such action that perfectly makes up for the earlier under- or over-achievements, or even no such action at all. Second, getting rid of all randomization goes against the intuition of some many members of our project team that randomization is a desirable feature that tends to increase rather than decrease safety (this intuition also underlies the alternative approach of quantilization). 

We think that one should thus not minimize performance metrics such as SDA or any other of the later discussed metrics, but at best use them as soft criteria. Arguably the most standard way to do this is to use a softmin (Boltzmann) policy for drawing both candidate actions  and  independently of each other, on the basis of their  (or another metric's) scores: 

restricted to those  with , and for some sufficiently small inverse temperature  to ensure sufficient randomization. 

While the SDA criterion is about keeping the variance of the Total rather low and is thus about fulfilling the goal in a presumably more desirable way from the viewpoint of the goal-setting client, the more important type of action selection criterion is not goal- but safety-related. So let us now look at the latter type of criterion. 

Toy example of a generic safety metric

As the apple example above shows, the agent might easily get into a situation where it seems to be a good idea to take an extreme action. A very coarse but straightforward and cheap to compute heuristic for detecting at least some forms of extreme actions is when the corresponding aspiration is close to an extreme value of the feasibility interval:  or 

This motivates the following generic safety criterion, the (relative, one-step) squared extremity of aspiration:

(The factor 4 is for normalization only) 

The rationale is that using actions  with rather low values of SEA would tend to make the agent's aspirations stay close to the center of the relevant feasibility intervals, and thus hopefully also make its actions remain moderate in terms of the achieved Total. 

Let's see whether that hope is justified by studying what minimizing SEA would lead to in the above apple example:

  • On day zero, , and  for . Since the current state-aspiration, 14, is in all actions' feasibility intervals, all action-aspirations are also 14. Both the under- and overachieving action with smallest SEA is thus , since this action's Q-midpoint, 6, is closest to its action-aspiration, 14. The corresponding 
  • On the next day, the new feasibility interval is , so the new state-aspiration is . This is simply the difference between the previous state-aspiration, 14, and the received Delta, 6. (It is easy to see that the aspiration propagation mechanism used has this property whenever the environment is deterministic). Since , we again have  for all , and thus again  since 6 is closest to 8. 
  • On the next day, ,  , and now  since 2 is closest to 2. Afterwards, the agent neither harvests not eats any apples but lies in the grass, relaxing.

The example shows that the consequence of minimizing SEA is not, as hoped, always an avoidance of extreme actions. Rather, the agent chooses the maximum action in the first two steps (which might or might not be what we eventually consider safe, but is certainly not what we hoped the criterion would prevent). 

Maybe we should compare Deltas rather than aspirations if we want to avoid this? So what about (relative, one-step) squared extremity of Delta,  

If the agent minimizes this instead of SEA in the apple example, it will procrastinate by choosing  for the first several days, as long as . This happens on the first four days. On the fifth day (), it will still have aspiration 14. It will then again put  with action-aspiration, but it will put  with action-aspiration . Since the latter equals the state-aspiration, the calculated probability in step 3 of the algorithm turns out to be , meaning the agent will use action  for sure after all, rather than randomizing it with . This leaves an aspiration of . On the remaining two days, the agent then has to use  to fulfill the aspiration in expectation.

Huh? Even though both heuristics seem to be based on a similar idea, one of them leads to early extreme action and later relaxation, and the other to procrastination and later extreme action, just the opposite. Neither of them avoids extreme actions. 

One might think that one could fix the problem by simply adding the two metrics up into a combined safety loss, 

But in the apple example, the agent would still harvest six apples on the first day, since . Only on the second day, it would harvest just four instead of six apples, because  is minimal. On the third day: two apples because  is minimal. Then one apple, and then randomize between one or zero apples, until it has the 14 apples, or until the last day has come where it needs to harvest one last apple.

The main reason for all these heuristics failing in that example is their one-step nature which does not anticipate later actions. In a later post we will thus study more complex, farsighted metrics that can still be computed efficiently in a recursive manner, such as disordering potential,

which measures the largest Shannon entropy in the state trajectory that the agent could cause if it aimed to, or terminal state distance,

which measures the expected distance between the terminal and initial state according to some metric  on state space and depends on the actual policy  and aspiration propagation scheme  that the agent uses. 

But before looking at such metrics more closely, let us finish this post by discussing other ways to combine several performance and/or safety metrics. 

Using several safety and performance metrics

Defining a combined loss function like  above by adding up components is clearly not the only way to combine several safety and/or performance metrics, and is maybe not the best way to do that because it allows unlimited trade-offs between the components. 

It also requires them to be measured in the same units and to be of similar scale, or to make them so by multiplying them with suitable prefactors that would have to be justified. This at least can always be achieved by some form of normalization, like we did above to make SEA and SED dimensionless and bounded by .

Trade-offs between several normalized loss components  can be limited in all kinds of ways, for example by using some form of economics-inspired "safety production function" such as the CES-type function 

with suitable parameters  and  with . For , we just have a linear combination that allows for unlimited trade-offs. At the other extreme, in the limit of , we get , which does not allow for any trade-offs.

Such a combined loss function can then be used to determine the two actions , e.g., using a softmin policy as suggested above. 

An alternative to this approach is to use some kind of filtering approach to prevent unwanted trade-offs. E.g., one could use one of the safety loss metrics, , to restrict the set of candidate actions to those with sufficiently small loss, say with, then use a second metric, , to filter further, and so on, and finally use the last metric, , in a softmin policy.

As we will see in a later post, one can come up with many plausible generic safety criteria, and one might be tempted to just combine them all in one of these ways, in the hope to thereby have found the ultimate safety loss function or filtering scheme. But it might also well be that one will have forgotten some safety aspects and end up with an imperfect combined safety metric. This would be just another example of a misspecified objective function. Hence we should again resist falling into the optimization trap of strictly minimizing that combined loss function, or the final loss function of the filtering scheme. Rather, the agent should probably still use a sufficient amount of randomization in the end, for example by using a softmin policy with sufficient temperature. 

If it is unclear what a sufficient temperature is, one can set the temperature automatically so that the odds ratio between the least and most likely actions equals a specified value:  for some fixed .

More freedom via aspiration intervals

We saw earlier that if the goal is to make expected Total equal an aspiration value, , the set of successful policies has large but not full dimension and thus still has measure zero. In other words, making expected Total exactly equal to some value still requires policies that are very "precise" and in this regard very "special" and potentially dangerous. So we should probably allow some leeway, which should not make any difference in almost all real-world tasks but increase safety further by avoiding solutions that are too special. 

Of course the simplest way to provide this leeway would be to just get rid of the hard constraint altogether and replace it by a soft incentive to make  close to , for example by using a softmin policy based on the mean squared error. This might be acceptable for some tasks but less so for others. The situation appears to be somewhat similar to the question of choosing estimators in statistics (e.g., a suitable estimator of variance), where sometimes one only wants the estimator with the smallest standard error, not caring for bias (and thus not having any guarantee about the expected value), sometimes one wants an unbiased estimator (i.e., an estimator that comes with an exact guarantee about the expected value), and sometimes one wants at least a consistent estimator that is unbiased in the limit of large data and only approximately unbiased otherwise (i.e., have only an asymptotic rather than an exact guarantee about the expected value). 

For tasks where one wants at least some guarantee about the expected Total, one can replace the aspiration value by an aspiration interval  and require that 

The basic algorithm (algorithm 1) can easily be generalized to this case and only becomes a little bit more complicated due to the involved interval arithmetic:

Decision algorithm 2    

Similar to algorithm 1, we... 

  • set action-aspiration intervals  for each possible action  on the basis of the current state-aspiration interval  and the action's feasibility interval , trying to keep  similar to and no wider than ,
  • choose an "under-achieving" action  and an "over-achieving" action  w.r.t. the midpoints of the intervals  and ,
  • choose probabilistically between  and  with suitable probabilities, and
  • propagate the action-aspiration interval  to the new state-aspiration interval  by rescaling between the feasibility intervals of  and .

More precisely: Let  denote the midpoint of interval . Given state  and state-aspiration interval ,

  1. For all , let  be the closest interval to  that lies within  and is as wide as the smaller of  and .
  2. Pick some  with 
  3. Compute 
  4. With probability , put , otherwise put 
  5. Implement action  in the environment and observe successor state 
  6. Compute  and , and similarly for  and 

Note that the condition that no  must be wider than  in step 1 ensures that, independently of what the value of  computed in step 3 will turn out to be, any convex combination  of values  is an element of . This is the crucial feature that ensures that aspirations will be met:

Theorem (Interval guarantee)

If the world model predicts state transition probabilities correctly and the episode-aspiration interval  is a subset of the initial state's feasibility interval, , then algorithm 2 fulfills the goal .

The Proof is completely analogous to the proof in the last post, except that each occurrence of  and  is replaced by  and , linear combination of sets is defined as , and we use the fact that  implies .

Special cases

Satisficing. The special case where the upper end of the aspiration interval coincides with the upper end of the feasibility interval leads to a form of satisficing guided by additional criteria.

Probability of ending in a desirable state. A subcase of satisficing is when all Deltas are 0 except for some terminal states where it is 1, indicating that a "desirable" terminal state has been reached. In that case, the lower bound of the aspiration interval is simply the minimum acceptable probability of ending in a desirable state.

Before discussing some more interesting safety metrics, we will first introduce in the next post a few simple environments to test these criteria in...



Discuss

[Aspiration-based designs] 2. Formal framework, basic algorithm

Published on April 28, 2024 1:02 PM GMT

Summary. In this post, we present the formal framework we adopt during the sequence, and the simplest form of the type of aspiration-based algorithms we study. We do this for a simple form of aspiration-type goals: making the expectation of some variable equal to some given target value. The algorithm is based on the idea of propagating aspirations along time, and we prove that the algorithm gives a performance guarantee if the goal is feasible. Later posts discuss safety criteria, other types of goals, and variants of the basic algorithm.

Assumptions

In line with the working hypotheses stated in the previous post, we assume more specifically the following in this post:

  • The agent is a general-purpose AI system that is given a potentially long sequence of tasks, one by one, which it does not know in advance. Most aspects of what we discuss focus on the current task only, but some aspects relate to the fact that there will be further, unknown tasks later (e.g., the question of how much power the agent shall aim to retain at the end of the task).
  • It possesses an overall world model that represents a good enough general understanding of how the world works.
  • Whenever the agent is given a task, an episode begins and its overall world model provides it with a (potentially much simpler) task-specific world model that represents everything that is relevant for the time period until the agent gets a different task or is deactivated, and that can be used to predict the potentially stochastic consequences of taking certain actions in certain world states.
  • That task-specific world model has the form of a (fully observed) Markov Decision Process (MDP) that however does not contain a reward function  but instead contains what we call an evaluation function related to the task (see 2nd to next bullet point).
  • As a consequence of a state transition, i.e., of taking a certain action  in a certain state  and finding itself in a certain successor state , a certain task-relevant evaluation metric changes by some amount. Importantly, we do not assume that the evaluation metric inherently encodes things of which more is better. E.g., the evaluation metric could be global mean temperature, client's body mass, x coordinate of the agent's right thumb, etc. We call the step-wise change in the evaluation metric the received Delta in that time step, denoted . We call its cumulative sum over all time steps of the episode the Total, denoted . Formally, Delta and Total play a similar role for our aspiration-based approach as the concepts of "reward" and "return" play for maximization-based approaches. The crucial difference is that our agent is not tasked to maximize Total (since the evaluation metric does not have the interpretation of "more is better") but to aim for some specific value of the Total.
  • The evaluation function contained in the MDP specifies the expected value of  for all possible transitions: .[1]

First challenge: guaranteeing the fulfillment of expectation-type goals

The challenge in this post is to design a decision algorithm for tasks where the agent's goal is to make the expected (!) Total equal (!) a certain value  which we call the aspiration value[2] This is a crucial difference from a "satisficing" approach that would aim to make expected Total at least as large as  and would thus still be happy to maximize Total. Later we consider other types of tasks, both less restrictive ones (including those related to satisficing) and more specific ones that also care about other aspects of the resulting distribution of Total or states.

It turns out that we can guarantee the fulfillment of this type of goal under some weak conditions!

Notice that a special case of such expectation-type goals is making sure that the probability of reaching a certain set of acceptable terminal states equals a certain value, because we can simply assume that each such acceptable terminal state gives 1 Delta and all others give zero Delta. We will come back to that special case in the next post when discussing aspiration intervals.

Example: Shopping for apples

  • The agent is a certain household's AI butler. Among all kinds of other tasks, roughly once a week it is tasked to go shopping for apples. When it gets this task in the morning, an apple shopping episode begins, which ends when the agent returns to the household to get new instructions or is called by a household member on its smartphone, or when its battery runs empty.
  • The relevant evaluation metric for this task is the household's and agent's joint stock of apples. It changes by some Delta each time the agent gets handed some apples by a merchant or takes some from the supermarket's shelf or when some apples fall from its clumsy hands or get stolen by some robot hater (or when some member of the household eats an apple).
  • As the household's demand for apples is 21 apples per week on average, the task in a single apple shopping episode is to buy a certain number  of apples in expectation. They also want to maintain some small stock for hard times, say about 5 apples. So the household's policy is to set the aspiration to , where  is its current stock of apples. As this is a recurrent task, it is perfectly fine if this aspiration is only fulfilled in expectation if the Total doesn't vary too much, since over many weeks (and assuming long-lived apples), the deviations will probably average out and the stock will vary around 26 apples right after a shopping mission and 5 apples just before the next shopping mission. In reality, the household would of course also want the variance to be small, but that is a performance criterion we will only add later.

Possible generalizations (can be skipped safely)

In later posts, we will generalize the above assumptions in the following ways:

  • Instead of as a single value  that expected Total shall equal, the task can be given as an aspiration interval  into which expected Total shall fall (e.g., "buy about  apples").
  • Instead of a single evaluation metric (stock of apples), there can be  many evaluation metrics (stock of apples, stock of pears, and money in pocket), and the task can be given as a convex aspiration set  (e.g., buy at least two apples and one pear but don't spend more than 1 Euro per item).
  • Instead of in terms of evaluation metrics, the task could be given in terms of the terminal state of the episode, by specifying a particular "desired" state, a set of "acceptable" states, a desired probability distribution of terminal states, or a set of acceptable probability distribution of terminal states. For example, demanding that the expected number of stocked apples after the episode be 26 is the same as saying that all probability distributions of terminal states are acceptable that have the feature that the expected number of stocked apples is 26. A more demanding task would then be to say that only those probability distributions of terminal states are acceptable for which the expected number of stocked apples is 26, its standard deviation is at most 3, its 5 per cent quantile is at least 10, and the 5 per cent quantile of the number of surviving humans is at least 8 billion.
  • The world model might have a more general form than an MDP: to represent different forms of uncertainty, it might be an only partially observed MDP (POMDP), or an ambiguous POMDP (APOMDP); to represent complex tasks, it might be a hierarchical MDP whose top-level actions (e.g., buy 6 apples from this merchant) are represented as lower-level MDPs with lower-level aspirations specified in terms of auxiliary evaluation metrics (e.g., don't spend more than 5 minutes waiting at this merchant), and its lowest level might also have continuous rather than discrete time (if it represents, e.g., the continuous control of a robot's motors).

Notation

We focus on a single episode for a specific task.

Environment. We assume the agent's interaction with the environment consists of an alternating sequence of discrete observations and actions. As usual, we formalize this by assuming that after each observation, the agent chooses an action  and then "calls" a (potentially stochastic) function  provided by the environment that returns the next observation.[3]

World model. The episode's world model, , is a finite, acyclic MDP. The model's discrete time counter, , advances whenever the agent makes an observation. From the sequence of observations made until time , the world model constructs a representation of the state of the world, which we call the model state or simply the state and denote by , and which also contains information about the time index  itself.[4] The set of all possible (model) states is a finite set . A subset  of states is considered terminal. If the state is terminal, the episode ends without further action or Delta, which represents the fact that the agent becomes inactive until given a new task. Otherwise, the agent chooses an action . The world model predicts the consequences of each possible action by providing a probability distribution  for the successor state . It also predicts the expected Delta for the task-relevant evaluation metric as a function of the state, action, and successor state: .

The following two graphs depict all this. Entities occurring in the world model are blue, those in the real world red, green is what we want to design, and dotted things are unknown to the agent:

Fig. 1: Planning with a world model, then acting in the real environment.

We hide the process of constructing the next state from the previous state, action, and next observation by simply assuming that the agent can call a version of the function  that is given the current state and action and returns the successor state constructed from the next observation: .

Goal. The goal is given by an aspiration value . The task is to choose actions so that the expected Total

 equals .

Auxiliary notation for interval arithmetic. We will use the following abbreviations:

  •  
    (interpolation between  and ),
  •  
    (relative position of  in interval , with the convention that ),
  •  
    ("clipping''  to interval ).

Sequential decisions based on propagated aspirations

Main idea

 Our agent will achieve the goal by

  • propagating the aspiration along the trajectory as we go from states  via actions  to successor states , leading to an alternating sequence of state aspirations  and action aspirations .
  • sequentially deciding on the next action  on the basis of the current state aspiration  and suitably chosen action-aspirations  for all possible actions .

For both aspiration propagation and decision making, the agent uses some auxiliary quantities that it computes upfront at the beginning of the episode from the world model as follows.

Feasibility intervals

Similar to what is done in optimal control theory, the agent computes[5] the - and -functions of the hypothetical policy that would maximize expected Total, here denoted  and , by solving the respective Bellman equations 

with  for terminal states . It also computes the analogous quantities for the hypothetical policy that would minimize expected Total, denoted  and 

with  for terminal states . These define the state's and action's feasibility intervals,

The eventual use of these intervals will be to rescale aspirations from step to step. Before we come to that, however, we can already prove a first easy fact about goals of the type "make sure that expected Total equals a certain value":

Lemma: Trivial guarantee

If the world model predicts state transitions correctly, then there is a decision algorithm that fulfills the goal  if and only if the episode's starting aspiration  is in the initial state's feasibility interval .

Proof. The values  and  are, by definition, the expected Total of the maximizing resp. minimizing policy, and hence it is clear that there cannot be a policy which attains  in expectation if  is larger than  or smaller than .

Conversely, assuming that  lies inside the interval , the following procedure fulfills the goal:

  • We compute the relative position of  inside .
  • With probability , we use the maximizing policy  throughout the episode, and with probability , we use the minimizing policy  throughout the episode.

This fulfills the goal, since the correctness of the model implies that, when using  or , we actually get an expected Total of  resp. 

Of course, using this "either maximize or minimize the evaluation metric" approach would be catastrophic for safety. For example, if we tasked an agent with restoring Earth's climate to a pre-industrial state, using as our evaluation metric the global mean temperature, this decision algorithm might randomize, with carefully chosen probability, between causing an ice age and inducing a runaway greenhouse effect! This is very different from what we want, which is something roughly similar to pre-industrial climate.

Another trivial idea is to randomize in each time step  between the action with the largest  and the one with the smallest , using a fixed probability  resp.  . Since expected Total is a continuous function of  which varies between and , by the Intermediate Value Theorem there exists some value of  for which this algorithm gives the correct expected Total; however, it is unclear how to compute the right  in practice.

If the episode consists of many time steps, this method might not lead to extreme values of the Total, but it would still make the agent take an extreme action in each time step. Intuition also suggests that the agent's behavior would be less predictable and fluctuate more than in the first version, where it consistently maximizes or minimizes after the initial randomization, and that this is undesirable.

So let us study more intelligent ways to guarantee that .

Decision Algorithm 1: steadfast action-aspirations, rescaled state-aspirations

In order to avoid extreme actions, our actual decision algorithm chooses "suitable" intermediate actions which it expects to allow it to fulfill the goal in expectation. When in state , it does so by

  • setting action-aspirations  for each possible action  on the basis of the current state-aspiration  and the action's feasibility interval , trying to keep  close to ,
  • choosing in some arbitrary way an "under-achieving" action  and an "over-achieving" action  w.r.t.  and these computed action-aspirations ,
  • choosing probabilistically between  and  with suitable probabilities,
  • executing the chosen action  and observing the resulting successor state , and
  • propagating the action-aspiration  to the new state-aspiration  by rescaling between the feasibility intervals of  and .

More precisely: First compute (or learn) the functions , and . Then, given state  and a feasible state-aspiration ,

  1. For all available actions , compute action-aspirations .
  2. Pick some actions  with ; these necessarily exist because .
  3. Compute .
  4. With probability , let , otherwise let .
  5. Execute action  in the environment and observe the successor state .
  6. Compute  
    and the successor state's state-aspiration .

If we add the state- and action aspirations as entities to the diagram of Fig. 1, we get this:

Fig. 2: Propagating aspirations while acting in the environment.

Example: Shopping for apples, revisited with math

We return to the apple-shopping scenario mentioned above, which we model by the following simple state-action diagram:

Fig. 3: Toy example – apple-shopping environment to demonstrate Algorithm 1

Our agent starts at home (state ) and wishes to obtain a certain number of apples, which are available at a market (state ). It can either walk to the market (action ), which will certainly succeed, or take public transportation (action ), which gives a 2/3 chance of arriving successfully at the market and a 1/3 chance of not reaching the market before it closes and returning home empty-handed. Of course, the agent can also decide to take the null action (action ) and simply stay home the entire day doing nothing.
Once it reaches the market , the agent can buy either one or two packs of three apples (actions  and , respectively) before returning home at the end of the day (state ).

To apply Algorithm 1, we first compute the - and -functions for the maximizing and minimizing policies. Since there are no possible cycles in this environment, straightforwardly unrolling the recursive definitions from the back ("backward induction") yields:

  
  *
  *
  
  *
  
  *
  

(The asterisks* mark which actions give the  values)

Suppose that the agent is in the initial state  and has the aspiration .

  1.  First, we calculate the action-aspirations: if I were to take a certain action, what would I aspire to? Here, the state-aspiration  lies within the action's feasibility set , so the action-aspiration  is simply set equal to . By contrast, the intervals  and  do not contain the point , and so  is clipped to the nearest admissible value:
Fig. 4: Setting action-aspirations in the toy example by clipping state aspiration to actions' feasibility intervals. 
  1. Next, we choose an over-achieving action  and an under-achieving action .
    Suppose we arbitrarily choose actions  (do nothing) and  (walk to the market).
  2. We calculate the relative position of the aspiration  between  and :  .
  3. We roll a die and choose our next action to be either  with probability  or  with probability .
  4. We take the chosen action, walking to the market or doing nothing depending on the result of the die roll, and observe the consequences of our actions! In this case, there are no surprises, as the transitions are deterministic.
  5. We rescale the action-aspiration to determine our new state-aspiration. If we chose action , we deterministically transitioned to state , and so the feasibility interval  is equal to  and no rescaling is necessary (in other words, the rescaling is simply the identity map): we simply set our new state-aspiration to be . Likewise, if we took action , we end up in state  with state-aspiration .

Suppose now that we started with the same initial aspiration , but instead chose action  as our over-achieving action in step 2. In this case, algorithm execution would go as follows:

  1. Determine action-aspirations as before.
  2. Choose .
  3. Since  is exactly equal to our state-aspiration  and  is not,  is 1!
  4. Hence, our next action is deterministically .
  5. We execute action  and observe whether public transportation is late today (ending up in state ) or not (which brings us to state ).
  6. For the rescaling, we determine the relative position of our action-aspiration in the feasibility interval: .
    If we ended up in state , our new state-aspiration is then ; if we ended up in state , the state-aspiration is .

These examples demonstrate two cases:

  • If neither of the action feasibility intervals  nor  contain the state aspiration , we choose between taking action  and henceforth minimizing, or taking action  and henceforth maximizing, with the probability  that makes the expected total match our original aspiration .
  • If exactly one of the feasibility intervals  or  contains , then we choose the corresponding action with certainty, and propagate the aspiration by rescaling it.

There is one last case, where both feasibility intervals contain ; this is the case, for example, if we choose  in the above environment. Execution then proceeds as follows:

  1. We determine action-aspirations, as shown here:
  1. Suppose now that we choose actions  and  as  and . (If action  is chosen, then we are in the second case shown above.)
  2.  is defined as the relative position of  between  and , but in this case, these three values are equal! We have chosen above (see auxiliary notation) that  in this case, but any other probability would also be acceptable.
  3. We toss a coin to decide between actions  and .
  4. The chosen action is performed, either taking us deterministically to the market  if we chose action  or randomizing between  and  if we chose action .
  5. We propagate the action-aspiration to state-aspiration as before. If we chose action , then we have  and so the new state-aspiration is  If we chose action , then the new state-aspiration is  if we reached the market and  otherwise.

Now that we have an idea of how the decision algorithm works, it is time to prove its correctness.

Theorem: Algorithm 1 fulfills the goal

If the world model predicts state transitions correctly and the episode-aspiration  is in the initial state's feasibility interval, , then decision algorithm 1 fulfills the goal .

Proof.

First, let us observe that algorithm 1 preserves feasibility: if we start from state  with state-aspiration ,  then for all states  and actions  visited, we will have  and .
This statement is easily seen to be true for action-aspirations, as they are required to be feasible by definition in step 1, and correctness for state-aspirations follows from the definition of  in step 6.

Let us now denote by  the expected Total obtained by algorithm 1 starting from state  with state-aspiration , and likewise by  the expected Total obtained by starting at step 5 in algorithm 1 with action-aspiration .

Since the environment is assumed to be acyclic and finite[6], we can straightforwardly prove the following claims by backwards induction:

  1. For any state  and any state-aspiration  belonging to the feasibility interval , we indeed have .
  2. For any state-action pair  and any action-aspiration  belonging to the feasibility interval , the expected future Total  is in fact equal to .

We start with claim 1. The core reason why this is true is that, for non-terminal states , we chose the right  in step 3 of the algorithm:

Claim 1 also serves as the base case for our induction: if  is a terminal state, then  is an interval made up of a single point, and in this case claim 1 is trivially true.

Claim 2 requires that the translation between action-aspirations, chosen before the world's reaction is observed, and subsequent state-aspirations, preserves expected Total. The core reason why this works is the linearity of the rescaling operations in step 6:

This concludes the correctness proof of algorithm 1. 

Notes

  • It might seem counterintuitive that the received Delta (or at least the expected Delta) is never explicitly used in propagating the aspiration. The proof above however shows that it is implicitly used when rescaling from  (which contains ) to  (which does not contain it any longer). 
  • We clip the state aspiration to the actions' feasibility intervals to keep the variability of the resulting realized Total low. If the state aspiration is already in the action's feasibility interval, this does not lead to extreme actions. However, if it is outside an action's feasibility interval, it will be mapped onto one endpoint of that interval, so if that action is actually chosen, the subsequent behavior will coincide with a maximizer's or minimizer's behavior from that point on.  
  • The intervals  and  used for clipping and rescaling can also be defined in various other ways than equation (F) to avoid taking extreme actions even more, e.g., by using two other, more moderate reference policies than the maximizer's and minimizer's policy.
  • Rather than clipping the state aspiration to an action's feasibility interval, one could also rescale it from the state's to the action's feasibility interval, . This would avoid extreme actions even more, but would overly increase the variability of received Total even in very simple environments.[7]
  • Whatever rule the agent uses to set action-aspirations and choose candidate actions, it should be for each action independently, because looking at all possible pairs or even all subsets or lotteries of actions would increase the algorithm's time complexity more than we think is acceptable if we want it to remain tractable. We will get back to the question of algorithm complexity in later posts.  

Outlook

The interesting question now is what criteria the agent should use to pick the two candidate actions  in step 2. We might use this freedom to choose actions in a way that increases safety, e.g., by choosing randomly as a form a "soft optimization" or by incorporating safety criteria like limiting impact. We'll explore these ideas in the next post in this sequence.

 

  1. ^

    When we extend our approach later to incorporate performance and safety criteria, we might also have to assume further functions, such as the expected squared Delta (to be able to estimate the variance of Total), or some transition-internal world trajectory entropy (to be able to estimate total world trajectory entropy), etc.

  2. ^

    Note that this type of goal is also implicitly assumed in the alternative Decision Transformer approach, where a transformer network is asked to predict an action leading to a prompted expected Total. 

  3. ^

    If the agent is a part of a more complex system of collaborating agents (e.g., a hierarchy of subsystems), the "action" might consist in specifying a subtask for another agent, that other agent would be modeled as part of the "environment" here, and the observation returned by  might be what that other agent reports back at the end of its performing that subtask.    

  4. ^

    It is important to note that  might be an incomplete description of the true environment state, which we denote  but rarely refer to here. 

  5. ^

    Note that we're in a model-based planning context, so it directly computes the values recursively using the world model, rather than trying to learn it from acting in the real or simulated environment using some form of reinforcement learning.

  6. ^

    These assumptions may of course be generalized, at the cost of some hassle in verifying that expectations are well-defined, to allow cycles or infinite time horizons, but the general idea of the proof remains the same.

  7. ^

    E.g., assume the agent can buy zero or one apple per day, has two days time, and the aspiration is one apple. On the first day, the state's feasibility interval is , the action of buying zero apples has feasibility interval  and would thus get action-aspiration 0.5, and the action of buying one apple has feasibility interval  and would thus get action-aspiration 1.5. To get 1 on average, the agent would thus toss a coin. So far, this is fine, but on the next day the aspiration would not be 0 or 1 in order to land at 1 apple exactly, but would be 0.5. So on the second day, the agent would again toss a coin. Altogether, it would get 0 or 1 or 2 apples, with probabilities 1/4, 1/2, and 1/4, rather than getting 1 apple for sure.



Discuss
❌
❌