1. Multi-armed Bandits
The main difference of reinforcement learning from other learning methods is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior.
Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. The latter is the basis of supervised learning, which includes large parts of pattern classification, artificial neural networks, and system identification.
In their pure forms, these two kinds of feedback are quite distinct: evaluative feedback depends entirely on the action taken, while instructive feedback is independent of the action taken.
k-armed bandit problem: Consider a problem in which you are asked to make a repeated choice among k-different options, or actions. Each choice yields a numerical reward chosen from a stationary probability distribution depending on the action selected. The goal is to maximize the expected total reward over same time period t=T.
In the k-armed bandit problem, each of the k actions has an expected (mean) reward given that the action is selected, and let's call it the value of that action. Notice, if we knew the value of each action, then the trivial solution is to select the action with highest value repeatedly. However, we often do not know the action values with certainty but have may have estimates. Now, let us introduce some notation related to the above mentioned problem:
At:=Action selected on time step t
Rt:=Corresponding reward on time step t to action At
q∗(a):=E[Rt∣At=a]:= Expected reward given that a is selected
Qt(a):=The estimated value of an action a at time step t
If we have the estimates of the action values, then at each time step there is at least one action whose estimated value is greatest, called the greedy action. When we select one of these actions, it is called exploiting our current knowledge of the values of the actions. If we select one of the non-greedy actions, then it is called exploring as it enables us to improve our estimate of the non-greedy action's value. Exploiting is the right strategy to maximize the expected reward on the one step, however exploration may produce greater total reward in the long run, even though it will be lower in the short run. Since it is not possible to both to explore and to exploit with a single action selection, one often refers to the conflict between exploration and exploitation.
In this chapter several simple balancing methods for the k-armed bandit will be presented to show that they work much better than methods that always exploit. The need to balance exploration and exploitation is the main challenge in reinforcement learning.
1.1. Action-value Methods
Methods for estimating the values of actions and using the estimates to make action selection decisions are called action-value methods. One way to estimate the true value of an action is to average the rewards received:
Qt(a):=number of times a taken prior to tsum of rewards when a taken prior to t=i=1∑t−1IAi=ai=1∑t−1RiIAi=a The simplest action selection rule is to select one of the greedy action, and making a random selection in case there are more than one. This greedy action selection method can be written as:
1.3. The 10-armed Testbed
2. Finite Markov Decision Processes
3. Dynamic Programming
4. Monte Carlo Methods
5. Temporal-Difference Learning
6. n-step Bootstrapping
7. Planning and Learning with Tabular Methods