首页> 外文OA文献 >On the Whittle Index for Restless Multi-armed Hidden Markov Bandits
【2h】

On the Whittle Index for Restless Multi-armed Hidden Markov Bandits

机译:论不安定的多臂隐马尔可夫匪徒的指数

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We consider a restless multi-armed bandit in which each arm can be in one oftwo states. When an arm is sampled, the state of the arm is not available tothe sampler. Instead, a binary signal with a known randomness that depends onthe state of the arm is available. No signal is available if the arm is notsampled. An arm-dependent reward is accrued from each sampling. In each timestep, each arm changes state according to known transition probabilities whichin turn depend on whether the arm is sampled or not sampled. Since the state ofthe arm is never visible and has to be inferred from the current belief and apossible binary signal, we call this the hidden Markov bandit. Our interest isin a policy to select the arm(s) in each time step that maximizes the infinitehorizon discounted reward. Specifically, we seek the use of Whittle's index inselecting the arms. We first analyze the single-armed bandit and show that ingeneral, it admits an approximate threshold-type optimal policy when there is apositive reward for the `no-sample' action. We also identify several specialcases for which the threshold policy is indeed the optimal policy. Next, weshow that such a single-armed bandit also satisfies an approximate-indexabilityproperty. For the case when the single-armed bandit admits a threshold-typeoptimal policy, we perform the calculation of the Whittle index for each arm.Numerical examples illustrate the analytical results.
机译:我们考虑一个不安定的多臂匪,其中每个臂可以处于两种状态之一。对手臂进行采样时,采样器无法使用手臂的状态。取而代之的是具有已知随机性的二进制信号,该信号取决于臂的状态。如果未对手臂进行采样,则无信号可用。每次采样都会获得与手臂相关的奖励。在每个时间步中,每个臂根据已知的转移概率更改状态,而转移概率又取决于臂是否已采样。由于手臂的状态从不可见,必须从当前的信念和可能的二进制信号中推断出来,因此我们将其称为隐藏的马尔可夫匪徒。我们的兴趣在于制定一个在每个时间步中选择手臂的策略,以最大化无限水平的折价奖励。具体来说,我们寻求使用Whittle指数来选择手臂。我们首先分析单臂匪徒,并证明一般来说,当对“无样本”行动有正面奖励时,它会接受近似阈值类型的最优策略。我们还确定了阈值策略确实是最佳策略的几种特殊情况。接下来,我们证明这种单臂匪还满足了近似可转位性。对于单臂匪徒接受阈值类型最优策略的情况,我们对每个臂进行Whittle指数的计算。数值示例说明了分析结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号