We consider a restless multi-armed bandit in which each arm can be in one oftwo states. When an arm is sampled, the state of the arm is not available tothe sampler. Instead, a binary signal with a known randomness that depends onthe state of the arm is available. No signal is available if the arm is notsampled. An arm-dependent reward is accrued from each sampling. In each timestep, each arm changes state according to known transition probabilities whichin turn depend on whether the arm is sampled or not sampled. Since the state ofthe arm is never visible and has to be inferred from the current belief and apossible binary signal, we call this the hidden Markov bandit. Our interest isin a policy to select the arm(s) in each time step that maximizes the infinitehorizon discounted reward. Specifically, we seek the use of Whittle's index inselecting the arms. We first analyze the single-armed bandit and show that ingeneral, it admits an approximate threshold-type optimal policy when there is apositive reward for the `no-sample' action. We also identify several specialcases for which the threshold policy is indeed the optimal policy. Next, weshow that such a single-armed bandit also satisfies an approximate-indexabilityproperty. For the case when the single-armed bandit admits a threshold-typeoptimal policy, we perform the calculation of the Whittle index for each arm.Numerical examples illustrate the analytical results.
展开▼