Reinforcement learning versus sequential decision analytics

Reinforcement learning is by now a substantial community in computer science that studies sequential decision problems, but with a style and accent that is unique to that community. This page offers a brief history of reinforcement learning and contrasts the styles of Sutton and Barto’s classic Reinforcement Learning: An Introduction (2nd edition, 2018) against the style of Warren Powell’s Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decision problems.

The discussion is divided into the following topics (this is a developing webpage):

Topic 1: The evolution of the field of reinforcement learning - From Sutton and Barto (1994), to Sutton and Barto (2018), to... the universal framework?

A horizontal timeline graphic with arrows tracing the evolution of the reinforcement-learning field from Sutton and Barto's 1998 first edition through their 2018 second edition, suggesting a possible convergence toward a universal framework The field of reinforcement learning is undergoing rapid evolution, as have the other fields that deal with sequential decision problems. In 1998, the first edition of Sutton and Barto’s Reinforcement Learning: An introduction appeared, focusing almost entirely on algorithms for approximating value functions.

In 2018, Sutton and Barto published the second edition of their book. This time, it covered a variety of algorithms, raising the question: just what is “reinforcement learning”? Is it a method? or a problem?

In the 2018 edition of Reinforcement Learning: An introduction, S&B note [p. 2] “Reinforcement learning… is simultaneously a problem, a class of solution methods that work well on the problem, and the field that studies this problem and its solution methods.” (Click here for a more complete discussion of the question “What is RL.”)

The field of reinforcement learning is undergoing the same evolution as most of the other major fields that work on sequential decision problems. As the community grows, it addresses a wider range of problems. Inevitably, no single method works on all problems, so people create new methods. What is unusual about the RL community is that, as they introduce new methods, they keep calling everything “reinforcement learning.”

My approach: First recognize that the common thread among all of these problems is that they are some form of sequential decision problem. Second, I found that every single method introduced for making decisions falls in one of the four classes of policies, or a hybrid. I discuss the four classes of policies, and hybrids, in depth in chapter 11 of RLSO (this can be downloaded from the RLSO webpage here).

Although my book (RLSO) covers a wide range of methods, it clearly identifies the problem class as sequential decision problems which can be modeled using the universal framework. It then introduces the four classes of policies (plus hybrids) which includes not only every method for making decisions that has been introduced in the literature (or used in practice), it also includes methods that haven’t even been invented yet (these are meta-classes). See Topic 3 (below) for a brief discussion of the four classes of policies.

For this reason, I view my book as the natural endpoint for reinforcement learning, as well as all the other fields that deal with decisions under uncertainty (I represent these with “stochastic optimization”).

Topic 2: Differences between Sutton and Barto’s Reinforcement Learning: An introduction (RL) and my Reinforcement Learning and Stochastic Optimization: A universal framework for sequential decisions (RLSO)

Two book covers side by side: Sutton and Barto's Reinforcement Learning: An Introduction (RL, 2018 second edition) on the left and Warren B. Powell's Reinforcement Learning and Stochastic Optimization (RLSO, 2022) on the right It is time to clarify the differences between “reinforcement learning” (as represented by Sutton and Barto’s Reinforcement Learning: An introduction) and the emerging field of sequential decision analytics, as captured by my book *Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions. *Below, I use “RL” to refer to Sutton and Barto’s “Reinforcement Learning” and “RLSO” to refer to my “Reinforcement Learning and Stochastic Optimization.” For more on RLSO, click here.

 

 

For more on the general field of sequential decision analytics, click here for a web page of links to different resources.

Topic 3: From Q-learning to the four classes of policies

The field of reinforcement learning followed my own early path: I started with a particular type of resource allocation problem (managing fleets of vehicles) where Bellman’s equation seemed to be a natural strategy. Deciding to send a vehicle from region i to region j required understanding the value of being in region j. The challenge with my setting was that I did not have a single vehicle: I needed to manage fleets of hundreds, even thousands, of vehicles. The state space exploded, which meant that I needed to approximate the downstream value.

As my range of applications expanded, the techniques that I found myself using also grew. This paralleled the path followed in reinforcement learning, as well as other communities such as stochastic search, optimal control, and simulation optimization.

I came to realize that all of these fields were steadily discovering methods for making decisions that fell into four classes:

  1. Policy function approximations (PFAs) - These are analytical functions that determine a decision given an analytical function of the information in the state variable. Examples include: order-up-to policies for inventory problems, buy-low, sell-high policies in finance, linear decision rules, Boltzman policies.
  2. Cost functions approximations (CFAs) - These are parameterized versions of simplified optimization problems. Examples include upper confidence bounding, parameterized linear programs, modified shortest path problems.
  3. Value function approximations (VFAs) - This covers the entire class of policies that depend on an approximation of a value function (often equated with “reinforcement learning”).
  4. Direct lookahead approximations (DLAs) - Here we optimize over some typically approximate version of the problem over some horizon to determine what to do now. I like to divide this class into two subclasses: a. Deterministic lookaheads (which might be parameterized) - This is often what people mean when they say “model predictive control.” b. Stochastic lookaheads - In chapter 19, section 19.7, I provide a variety of strategies for approximating stochastic lookaheads, focusing on the problem of designing policies within the lookahead model.

All of these policies are represented in Sutton and Barto’s Reinforcement Learning:

The graphic below illustrates the adoption of each of the eight most important fields of the four classes of policies (wider lines means earlier adoption).

Evolution of policies for different fields
Evolution of policies for different fields

I am constantly being asked: how to choose which class of policy to use? The graphic below lists five types of policies (one each from the first four classes, and then I split the four class into two types: deterministic lookahead and stochastic lookahead).

A 4-row table grouping the four classes of policies into popularity categories. Category 1 (most popular): PFAs (rules/analytical functions), CFAs (parameterized deterministic optimization), and deterministic DLAs — note: 'By far the most widely used in practice; the choice among the three tends to be obvious from the structure of the problem.' Category 2 — stochastic lookahead with discrete actions (decision trees, value-of-information policies): 'Popular in the decision analysis and Bayesian optimization communities.' Category 3 — stochastic lookahead using Bellman (policies based on VFAs / approximate dynamic programming): 'A very powerful strategy for a very small number of specialized problems.' Category 4 — stochastic lookahead with vectors (two-stage stochastic programming): 'This is how deterministic optimizers handle uncertainty. Almost no-one uses it in practice.'

The academic literature has been heavily biased first toward policies based on approximating Bellman’s equation, and second toward policies based on some form of stochastic lookahead (think of Monte Carlo tree search or stochastic programming). However, in practice I have found that the five types of policies can be organized into three categories, sorted from the ones that are most widely used to the policies that are least used:

Category 1: The most popular policies consist of:

Category 2: Stochastic direct lookaheads - Sometimes we need a lookahead, but we need to recognize that the future is uncertain.

Category 3: Policies based on value function approximations (VFAs) - After decades of focusing on approximate dynamic programming (and writing a major book on the topic), I am now convinced that policies based on approximate dynamic programming are the most difficult to use, and will be the least-used policy in practice.

Within category 1, the choice of which of the three policies to use will typically be obvious from the application. In the universe of decision problems that we encounter across the vast range of human activities, I am now convinced that policies in category 1 will be used the vast majority of the time (hint: this is what is happening now). However, often overlooked is that the policies in category 1 depend on tunable parameters, and as I say in RLSO: tuning is hard! See my discussion of parameter tuning in chapter 12 of RLSO (section 12.4 discusses the most popular strategies).

Note that in my opinion, based on many years of research focused specifically on policies based on approximating Bellman’s equation, I became convinced that while this is a very powerful strategy, it is useful for just a narrow set of applications. As an indication: in the universe of problems that require making decisions, real applications of Bellman’s equation are quite rare (one good example of the use of approximate dynamic programming is the software at my company, Optimal Dynamics, that uses VFAs to optimize the assignment of drivers over time).

See chapter 11, section 11.10, for my discussion of how to choose among the four classes of policies (this is downloadable from RLSO).

Topic 5: Choosing notation - Notation matters, and here I explain my particular choices.

I spent decades working out my notational system. in section 2.1 of RLSO, I present 15 different fields using the notation most popular in that field. My framework is closest to that used in optimal control, but with different notation. Specifically: