From reinforcement learning to sequential decision analytics

Warren B Powell

“Reinforcement learning” has been undergoing an evolution from the early work in the 1980s-1990s where it focused on a form of approximate dynamic programming called Q-learning, to the study of a diverse range of algorithmic strategies for sequential decision problems.  Since this time it has continued to evolve, along with a number of other communities that also work on sequential decision problems.  I feel I have anticipated where those entire evolution is likely to end up.

Please enter any comments on the ideas in this page at https://tinyurl.com/RLSOcomments.

My new book, Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions (RLSO) is the first book to propose a universal framework that models any sequential decision problem, from pure learning (multiarmed bandit) to complex resource allocation problems.  We formally pose the problem of optimizing over policies (methods for making decisions), and present four classes of policies that include any method for making decisions.  The entire book is organized around this universal modeling framework and the four classes of policies.

This modeling framework spans 15 distinct communities that work on sequential decision problems. These communities use eight distinct notational systems, and feature a wide range of solution techniques tailored to specific problem domains that arise in different fields.  However, there are four fields that make particularly important contributions: Markov decision processes, math programming, optimal (stochastic) control, and stochastic search (derivative-based and derivative-free).  Of course, we will draw heavily on the tools of machine learning and Monte Carlo simulation.

Most of these communities have been evolving from an initial algorithmic strategy to different strategies that span the four classes of policies.  This evolution has been taking place in a number of communities that work on sequential decision problems - the best illustration is the second edition of Sutton and Barto’s Reinforcement Learning book that now includes policy gradient methods, upper confidence bounding and Monte Carlo tree search.  In the graphic to the right, I have listed seven of the most prominent communities: stochastic search (derivative-free and derivative-based), simulation optimization, multiarmed bandits, optimal control, Markov decision processes, reinforcement learning and stochastic programming.  I then show which of the four classes of policies are actively studied in each community (wider lines represent the policies that were initially studied).

Below I list the differences between how these problems are described and solved in the reinforcement learning literature, and the modeling and solution approach that I use in RLSO.  I claim that my approach better represents how people in the RL community actually write their software, and the algorithms that are actually being used (including algorithms that have not yet been adopted!). Below I touch on three issues:

Modeling

\[\max_{\pi} \mathbb{E} \left\{\sum_{t=0}^T C(S_t,X^\pi(S_t))|S_0 \right\}\]

Designing policies

\[V(s) = \max_x \left[C(S,x)+\sum_{s'}P(s'|s,x)V(s')\right]\]

This is almost never computable, so people then begin creating algorithms for approximating the value functions, which is often viewed as some form of magic.  I spent 20 years working on this strategy, and while I enjoyed some real breakthroughs, I came to realize that this approach only works on a very narrow set of applications.  

We optimize over all four classes of policies (these are meta-classes), which covers any approach that you might use.  The four classes of policies can be divided into two broad strategies: policy search and lookahead approximations:

\[X^\pi(S_t|\theta)  = argmax_{x_t}\left(C(S_t,x_t) + \mathbb{E}\left\{{\bar V}_{t+1}(S_{t+1}|\theta)|S_t\right\}\right)\]

The expectation inside the max operator makes it difficult (usually impossible) to handle problems where x is a vector, but we can use a post-decision state \(S^x_t\) which is the state immediately after we make a decision (the RL community uses the unusual term “after state” variable, but it is important to recognize that it is the state after a decision is made).   This produces the policy:

\[X^\pi(S_t|\theta)  = argmax_{x_t}\left(C(S_t,x_t) + {\bar V}^x_t(S^x_t|\theta)\right)\]

Now we have a deterministic optimization problem, which opens the door to using solvers such as Cplex and Gurobi, making it possible to handle high-dimensional decision vectors such as arise in resource allocation problems.

VFA policies are known in the RL community as Q-learning, where \(Q(s,a)\) (using a for action instead of x for decision) is the value of being in state s and taking action a.  The policy would be written

\[A^\pi(S_t)  = argmax_{a}{\bar Q}(S_t,a)\]

DLA policies are widely known under the name “model predictive control.”  

Policy search/parameter tuning

\[F(\theta) = \mathbb{E}\left\{\sum_{t=0}^T C(S_t,X^\pi(S_t|\theta))|S_0\right\}\]
where $$S_{t+1} = S^M(S_t,X^\pi(S_t \theta),W_{t+1})\(. A simple example of\)X^\pi(S_t \theta)$$ would be a UCB policy popular in reinforcement learning for multiarmed bandit problems, given by
\[X^\pi(S_t|\theta) = \argmax_x \big({\bar \mu}_{tx} + \theta {\bar \sigma}_{tx}\big)\]

where \(x\) is one of a set of discrete choices (drugs, products),  \({\bar \mu}_{tx}\) is our current estimate of how well \(x\) will perform after t observations, and \(\theta {\bar \sigma}_{tx}\) is the estimate of the standard deviation of \({\bar \mu}_{tx}\).

Since we can never actually compute the expectation, we need a way for generating samples \(W^n_1, \ldots, W^n_T\) for a sample \(n=1, \ldots, N\).  We then approximate \(F(\theta)\) using \({\bar F}(\theta)\) using

\[{\bar F}(\theta) = \frac{1}{N}\sum_{n=1}^N\sum_{t=0}^T C(S^n_t,X^\pi(S^n_t|\theta))\]

where \(S^n_t\) is the state generated by following sample path \(W^n_1, \ldots, W^n_T\).

We now face the problem of determining the best value of \(\theta\).  We have been surprised at how often this is overlooked in the RL community, such as tuning \(\theta\) in our UCB policy above. 

Tuning is its own sequential decision problem, which can be tackled using either derivative-based methods (see Chapter 5 in RLSO) or derivative-free methods (Chapter 7).  This means that tuning a policy for multiarmed bandit problems requires solving another multiarmed bandit problem. We can model the tuning problem as a Markov decision problem, and solve it using any of the four classes of policies!  (See Chapter 7 for an in-depth discussion of this issue).

-