I get many questions about reinforcement learning -- how to think
about it and how do it successfully in practice. This FAQ is an
attempt to pull together in one place some of the more common
questions and answers. I have been free with my opinions, but I
would also welcome the opinions of others for inclusion here. In you
have anything to add to my comments, including whole new questions
or totally different answers, please send me a note at
Reinforcement learning (RL) is learning from interaction with an environment, from the consequences of action, rather than from explicit teaching. RL become popular in the 1990s within machine learning and artificial intelligence, but also within operations research and with offshoots in psychology and neuroscience.
Most RL research is conducted within the mathematical framework of Markov decision processes (MDPs). MDPs involve a decision-making agent interacting with its environment so as to maximize the cumulative reward it receives over time. The agent perceives aspects of the environment's state and selects actions. The agent may estimate a value function and use it to construct better and better decision-making policies over time.
RL algorithms are methods for solving this kind of problem, that is, problems involving sequences of decisions in which each decision affects what opportunities are available later, in which the effects need not be deterministic, and in which there are long-term goals. RL methods are intended to address the kind of learning and decision making problems that people and animals face in their normal, everyday lives.
For more information, see the sources recommended for an introduction to RL.
Modern reinforcement learning concerns both trial-and-error learning without a model of the environment, and deliberative planning with a model. By "a model" here we mean a model of the dynamics of the environment. In the simplest case, this means just an estimate of the state-transition probabilities and expected immediate rewards of the environment. In general it means any predictions about the environment's future behavior conditional on the agent's behavior.
To a first approximation, Reinforcement Learning and Neuro-Dynamic Programming are synonomous. The name "reinforcement learning" came from psychology (although psychologists rarely use exactly this term) and dates back to the eary days of cybernetics. For example, Marvin Minsky used this term in his 1954 thesis, and Barto and Sutton revived it in the early 1980's. The name "neuro-dynamic programming" was coined by Bertsekas and Tsitsiklis in 1996 to capture the idea of the field as a combination of neural networks and dynamic programming.
In fact, neither name is very descriptive of the subject, and I recommend you use neither when you want to be technically precise. Names such as this are useful when referring to a general body of research, but not for carefully distinguishing ideas from one another. In that sense, there is no point in trying to draw a careful distinction between the referents of these two names.
The problem with "reinforcement learning" is that it is dated. Much of the field does not concern learning at all, but just planning from complete knowledge of the problem (a model of the environment). Almost the same methods are used for planning and learning, and it seems off-point to emphasize learning in the name of the field. "Reinforcement" does not seem particularly pertinent either.
The problem with "neuro-dynamic programming" is similar in that neither neural networks nor dynamic programming are critical to the field. Many of the methods, such as "Monte Carlo", or "rollout" methods, are completely unrelated to dynamic programming, and neural networks are just one choice among many for method of function approximation. Moreover, one could argue that the component names, "neural networks" and "dynamic programming", are each not very descriptive of their respective methods.
Using function approximation, RL can apply to much larger state spaces than classical sequential optimization techniques such as dynamic programming. In addition, using simulations (sampling), RL can apply to systems that are too large or complicated to explicitly enumerate the next-state transition probabilities.
Ideally, the ideas of reinforcement learning could constitute part of a computational theory of what the brain is doing and why. A number of links have been drawn between reinforcement learning and neuroscience, beginning with early models of classical conditioning based on temporal-difference learning (see Barto and Sutton, 1982; Sutton and Barto, 1981, 1990; Moore et al., 1986), and continuing through work on foraging and prediction learning (see Montague et al., 1995, 1996), and on dopamine neurons in monkeys as a temporal-difference-error distribution system. A good survey paper is available. See also Suri, submitted. A book collects a number of relevant papers. Doya has extensively developed RL models of the basal ganglia. Many of these areas are very active at present and changing rapidly.
Broadly speaking, RL works as a pretty good model of instrumental learning, though a detailed argument for this has never been publically made (the closest to this is probably Barto, Sutton and Watkins, 1990). On the other hand, the links between classical (or Pavlovian) conditioning and temporal-difference (TD) learning (one of the central elements of RL) are close and widely acknowledged (see Sutton and Barto, 1990).
Ron Sun has developed hybrid models combining high-level and low-level skill learning, based in part on RL, which make contact with psychological data (see Sun, Merrill, and Peterson, 2001).
Formally, RL is unrelated to behaviorism, or at least to the aspects of behaviorism that are widely viewed as undesireable. Behaviorism has been disparaged for focusing exclusively on behavior, refusing to consider what was going on inside the head of the subject. RL of course is all about the algorithms and processes going on inside the agent. For example, we often consider the construction of internal models of the environment within the agent, which is far outside the scope of behaviorism.
Nevertheless, RL shares with behaviorism its origins in animal learning theory, and in its focus on the interface with the environment. RL's states and actions are essentially animal learning theory's stimuli and responses. Part of RL's point is that these are the essential common final path for all that goes on in the agent's head. In the end it all comes down to the actions taken and the states perceived.
Most work with genetic algorithms simulates evolution, not learning during an individual's life, and because of this is very different from work in RL. That having been said, there are two provisos. First, there is a large body of work on classifier systems that uses or is closely related to genetic algorithms. This work is concerned with learning during a single agent's lifetime (using GAs to organize the components of the agent's mind) and is in fact RL research. The second proviso is that GA work is often related to RL by virtue of being applied to the same problems. For example, GA methods can be applied to evolve a backgammon player and that player can be compared with a player learned by RL methods. In fact, a large portion of systems evolved by GAs are controllers that could alternatively be learned by RL methods. It is tempting here to make a blanket statement about which class of methods is more appropriate or performs better. A crucial distinction is that between problems in which state information is available and those in which it is not. In backgammon, for example, the state of the board is available. In problems like these RL methods seem to have a distinct advantage over evolutionary methods such as GAs. On other problems there is little state information. Suppose you wish to learn a golf swing, or some other ballistic motion that is generally over before useful sensory feedback is available. Then the system is far from Markov and there is little or no advantage provided by addressing the state information during a single swing. In these cases RL methods would not be expected to have an advantage over evolutionary methods. In fact, if they insisted on trying to treat sensory information as state, as in using temporal-difference methods, they may well do worse.
There is plenty of blame here to go around. Minsky, Bellman, Howard, Andreae, Werbos, Barto, Sutton, Watkins... All played important or early roles. See the history from the Sutton and Barto text.
For a general introduction, I recommend my book with Prof. Barto:For a more formal treatment, including rigorous proofs, I recommend the text by Bertsekas and Tsitsiklis: If you don't have time for a textbook-length treatment, your best bet is one or both of these two papers:
There is also a lot of other material there: code, figures, errata, some teaching materials.
Readers using the book for self study can obtain answers on a chapter-by-chapter basis after working on the exercises themselves. Send email to firstname.lastname@example.org with your efforts to answer the exercises for a chapter, and I will send back a postscript file with the answers for that chapter.
Yes, but you can't get by with simple tables; you will need some kind of function approximation.
"Function approximation" refers to the use of a parameterized functional form to represent the value function (and/or the policy), as opposed to a simple table. A table is able to represent the value of each state separately, without confusion, interaction, or generalization with the value of any other state. In typical problems, however, there are far too many states to learn or represent their values individually; instead we have to generalize from observed to states to new, unobserved ones. In principle, this need not be a problem. There are a host of supervised learning methods that can used to approximate functions. However, there are both theoretical and practical pitfalls, and some care is needed. See Chapter 8 of the Sutton and Barto text.
For the most part, the theoretical foundation that RL adopts from dynamic programming is no longer valid in the case of function approximation. For example, Q-learning with linear function approximation is known to be unsound. The strongest positive result is for on-policy prediction with linear function approximators ( Tsitsiklis and Van Roy, 1997; Tadic, 2001). This is an area of active current research (e.g., see Gordon, 2001; Precup, Sutton & Dasgupta, 2001).
It is true that most RL work has considered discrete action spaces, but this was usually done for convenience, not as an essential limitation of the ideas; and there are exceptions. Nevertheless, it is often not obvious how to extend RL methods to continuous, or even large discrete, action spaces. The key problem is that RL methods typically involve a max or sum over elements of the action space, which is not feasible if the space is large or infinite. The natural approach is to replace the enumeration of actions with a sample of them, and average (just as we replace the enumeration of possible next states with a sample of the, and average). This requires either a very special structure for the action-value function, or else a stored representation of the best known policy. Actor-critic methods are one approach.
With no attempt to be exhaustive, some of the earlier RL research with continuous actions includes:
I would be glad to include new or other work in this list as well. Please send me pointers!
Although the ideas of RL extend naturally, in principle, to continuous time, there has been relatively little work here. The best work on this so far is probably that by Kenji Doya.
The curse of dimensionality refers to the tendency of a state space to grow exponentially in its dimension, that is, in the number of state variables. This is of course a problem for methods such as dynamic programming and other table-based RL methods whose complexity scales linearly with the number of states. Many RL methods are able to partially escape the curse by sampling and by function approximation.
Curiously, the key step in the tile-coding approach to function approximation is expanding the original state representation into a vastly higher dimensional space. This makes many complex nonlinear relationships in the original representation simpler and linear in the expanded representation. The more dimensions in the expanded representation, the more functions can be learned easily and quickly. I call this the blessing of dimensionality. It is one of the ideas behind modern support vector machines, but in fact it goes back at least as far as the perceptron.
Basically, no. Tile coding is a quite general idea and many of the ways it can be used avoid the curse of dimensionality. There are at least three common tricks: 1) you can consider subsets of the state variables in separate tilings, 2) you can use overlapping tilings to get vastly higher resolution in high dimensional spaces than would be possible with a simple grid using the same amount of memory, and 3) you can hash down to a smaller space so as only to devote memory to the portions of state space that actually occur.
The idea of tile coding is essentially the same as that underlying CMACs. Without in any way intending to detract from the credit due the pioneers of CMACs (e.g, Albus, Marr, Miller...), sometimes it is better to switch to a new name. The name CMAC stands for, among other things, "Cerebellar Model Articulatory Controller," which seems pretty inappropriate for the current usage. The original CMAC also used a slightly different algorithm -- a correlation rule rather than an error correction rule. The use in reinforcement learning steps it even farther away from its original use. What remains is not so much a learning system (much less a cerebellar model), but a way of coding states for use by a learning system. The key features of the coding is that it is based on multiple exhaustive partitions (tilings!) of the state space, and that it is particularly suited for implementation on serial, digital computers.
It is a common error to use a backpropagation neural network as the function approximator in one's first experiments with reinforcement learning, which almost always leads to an unsatisfying failure. The primary reason for the failure is that backpropation is fairly tricky to use effectively, doubly so in an online application like reinforcement learning. It is true that Tesauro used this approach in his strikingly successful backgammon application, but note that at the time of his work with TDgammon, Tesauro was already an expert in applying backprop networks to backgammon. He had already built the world's best computer player of backgammon using backprop networks. He had already learned all the tricks and tweaks and parameter settings to make backprop networks learn well. Unless you have a similarly extensive background of experience, you are likely to be very frustrated using a backprop network as your function approximator in reinforcement learning.
The solution is to step back and accept you can only innovate in one area at a time. First make sure you understand RL ideas in the tabular case, and the general principles of function approximation in RL. Then proceed to a better understood function approximator, such as a linear one. In my own work I never go beyond linear function approximators. I just augment them with larger and larger feature vectors, using coarse tile-coding (see e.g., Chapter 8 of the book, Sutton, 1996; Stone and Sutton, 2001). It may be that this approach would always be superior to backpropagation nets, but that remains to be seen. But someone new to learning and RL algorithms certainly should not start with backprop nets.
The true answer, of course, is that we don't know, and that it probably hasn't been invented yet. Each algorithm has strengths and weaknesses, and the current favorite changes every few years. In the 1980s actor-critic methods were very popular, but in the 1990s they were largely superceded by value-function methods such as Q-learning and Sarsa. Q-learning is probably still the most widely used, but its instability with function approximation, discovered in 1995, probably rules it out for the long run. Recently policy-based methods such as actor-critic and value-function-less methods, including some of those from the 1980s, have become popular again.
So, it seems we must keep our minds and options open as RL moves forward.
This code is made available on an as-is basis, and in many cases I can provide little or no support. In many cases, I have no personal experience with the code, and in others it will not even run outside of my special computer environments. I make the code available nevertheless, because reading it may clarify some of the implementation points for you. In a few cases, like the tile coding implementations in C++, I have written the code myself to run universally, and I can provide some support.
Maybe you could contribute some of your own code, perhaps a class project, that is better documented and works on a wider range of computational platforms.