Understanding the Multi-Armed Bandit problem

The most important feature distinguishing reinforcement learning from other types is that it uses information that evaluates the action taken rather than instructs by giving correct actions. Because such evaluative feedback provides no direct guidance on other choices, an agent must explore in order to discover better behavior. This contrasts with supervised learning, where instructive feedback identifies correct actions independently of the agent’s decisions.

The evaluative feedback depends entirely on the action taken. But there is an aspect of this feedback working in a simplified setting, that does not involve learning to act in more than one situation. The formulation of this particular nonassociative, evaluative feedback problem is the multi-armed bandit problem (or k-armed bandit problem).

A k-armed Bandit Problem

The most used analogy to k-armed bandit problem is to a slot machine, or a ‘one-armed bandit’, except that it has k levers instead of one. Each arm selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated selections, you have to maximize your winnings (reward) by concentrating your actions on the best levers.

Here, each of the k actions has an expected or mean reward given that the action is selected, known as the value of the action.

But if we knew the value of each action, then it would be trivial to solve k-armed bandit problem. We would always select the action with highest value. So we assume that we do not know the action values with certainty.

If we maintain estimates of the action values, then at any time step there will be at least one action whose estimated value is greatest. Let’s call these the greedy actions. When we select such actions, we can say we are exploiting our current knowledge of action values.

If we select one of the “non-greedy actions”, we can say we are exploring, because this enables you to improve your estimates of non-greedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but Exploration may produce the greater total reward in the long run.

Reward is lower in the short run, during exploration but higher in the long run because after we have discovered the better actions, we can exploit them many times. In any specific case, whether it is better to explore or exploit depends in a complex way on the precise values of the estimates, uncertainties, and the number of remaining steps.

Action Value Methods

The methods for estimating the values of actions for using the estimates to make action selection decisions are called action-value methods. The true value of an action is the mean reward when that action is selected and one natural way to estimate this is by averaging the rewards actually received:

Q t (a) = \frac{sum of rewards when a taken prior to t}{number of times a taken prior to r} = \frac{\sum_{i = 1}^{t - 1} R i \cdot I {_{A i = a}}}{\sum_{i = 1}^{t - 1} I {_{A i = a}}}

where $I {predicate}$ denotes the random variable that is 1 if predicate is true and 0 if is not. If the denominator is zero, then we instead define above value to some default such as 0. As the denominator goes to infinity, above estimated value converges to the true value. We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards

Greedy and ε-Greedy Action Selection

The simplest action selection rule is to select one of the actions with the highest estimated value, that is, one of the greedy actions as defined above. If there is more than one greedy action, then a selection is made among them in some arbitrary way. We write this greedy action selection method as:

A t = argmax Q t (a)

where argmax denotes the action a for which the expression is maximized. Greedy action selection always exploits current knowledge to maximize immediate reward. So it spends no time at all sampling inferior actions to see if they might really be better.

Therefore, a simple alternative is to behave greedily most of the time, but every once in a while, say with small probability ε instead select randomly from among all the actions with equal probability, independently of the action-value estimates. We call methods using this near-greedy action selection rule ε-greedy methods.

The main advantage here is that as the number of steps increases, every action will be sampled an infinite number of times, ensuring all estimates converge to true values under stationarity. This of course implies that the probability of selecting the optimal action converges to greater than 1-ε i.e. near certainty.

Incremental Computation of Action Values

The action-value methods discussed above estimate action values as sample averages of observed rewards. However, it is not efficient.

The obvious implementation would be to maintain a record of all the rewards and then perform this computation whenever the estimated value was needed. However, if this is done, then the memory and computational requirements would grow over time as more rewards are seen. Each additional reward would require additional memory to store it and additional computation to compute the sum in the numerator.

So we simplify the notation.

After simplification, the general form of the update rule given for incremental time steps is as follows:

Q n + 1 = Q n + \frac{1}{n} (R n - Q n)

Here, n denotes the number of times the action has been selected.

NewEstimate <- OldEstimate + StepSize [ Target - OldEstimate ]

The expression [ Target - OldEstimate ] is an error in the estimate. It is reduced by taking a step towards the “Target”.

Tracking a non-stationary problem

The methods discussed so far are good for problems in which reward probabilities do not change over time. However, many real-world problems violate this assumption and we often encounter reinforcement learning problems that are non-stationary in nature.

For such cases it makes sense to give more weight to recent rewards than to past. One of the most popular ways of doing this is to use a constant step-size parameter - α ∈ (0,1]. This update gives more weight to recent rewards and gradually forgets older observations.

Unrolling the updated equation gives -

Q n_{+ 1} = {(1 - α)}^{n} Q 1 + \sum_{i = 1}^{n} α {(1 - α)}^{n - i} R i

We call this weighted average because the sum of weights here is equal to 1

Upper-Confidence Bound (UCB) Action Selection

To address the exploration–exploitation tradeoff discussed earlier about the accuracy of the action-value estimates, the greedy actions may look best at present, but some of the other actions may actually be better in long run.

ε-greedy action method forces the non-greedy actions to be tried, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. One effective way to select the non-greedy action according to their potential for actually being optimal taking into account both how close their estimates are to being maximal and the uncertainties in those estimates is to select them according to:

A t = argmax [Q t (a) + c \sqrt{\frac{\ln t}{N t (a)}}]

This is called Upper Confidence Bound (UCB) action selection. The square root term here is a measure of the uncertainty arising from limited sampling, in the estimate of a’s value. The quantity being maxed over is thus a sort of upper bound on the possible true value of action a.

The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

Associative Search (Contextual Bandits)

In many settings, the agent encounters different bandit problems under different circumstances. If the agent receives contextual information indicating which problem it is currently facing, it can learn a policy that maps contexts to actions.

These problems are known as Associative Search tasks or Contextual Bandits. They extend the k-armed bandit framework by conditioning action selection on observable context, while still retaining the bandit property that actions affect only immediate rewards.

These search tasks are intermediate between the k-armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but like our version of the k -armed bandit problem in that each action affects only the immediate reward. If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem. Contextual bandits are therefore a natural extension of classical bandits.

Thank you for reading!!

The content of the blog is referenced from Reinforcement Learning - An Introduction (2nd Edition) by Richard S.Sutton and Andrew G. Barto. If you are curious to learn more about Reinforcement Learning, I recommend this book to satisfy your curiosity.

Also, if I have missed anything or if you have any thoughts to share, feel free to contact me:)