On the Whittle Index for Restless Multi-armed Hidden Markov Bandits

机译：论不安定的多臂隐马尔可夫匪徒的指数

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We consider a restless multi-armed bandit in which each arm can be in one oftwo states. When an arm is sampled, the state of the arm is not available tothe sampler. Instead, a binary signal with a known randomness that depends onthe state of the arm is available. No signal is available if the arm is notsampled. An arm-dependent reward is accrued from each sampling. In each timestep, each arm changes state according to known transition probabilities whichin turn depend on whether the arm is sampled or not sampled. Since the state ofthe arm is never visible and has to be inferred from the current belief and apossible binary signal, we call this the hidden Markov bandit. Our interest isin a policy to select the arm(s) in each time step that maximizes the infinitehorizon discounted reward. Specifically, we seek the use of Whittle's index inselecting the arms. We first analyze the single-armed bandit and show that ingeneral, it admits an approximate threshold-type optimal policy when there is apositive reward for the `no-sample' action. We also identify several specialcases for which the threshold policy is indeed the optimal policy. Next, weshow that such a single-armed bandit also satisfies an approximate-indexabilityproperty. For the case when the single-armed bandit admits a threshold-typeoptimal policy, we perform the calculation of the Whittle index for each arm.Numerical examples illustrate the analytical results.

机译：我们考虑一个不安定的多臂匪，其中每个臂可以处于两种状态之一。对手臂进行采样时，采样器无法使用手臂的状态。取而代之的是具有已知随机性的二进制信号，该信号取决于臂的状态。如果未对手臂进行采样，则无信号可用。每次采样都会获得与手臂相关的奖励。在每个时间步中，每个臂根据已知的转移概率更改状态，而转移概率又取决于臂是否已采样。由于手臂的状态从不可见，必须从当前的信念和可能的二进制信号中推断出来，因此我们将其称为隐藏的马尔可夫匪徒。我们的兴趣在于制定一个在每个时间步中选择手臂的策略，以最大化无限水平的折价奖励。具体来说，我们寻求使用Whittle指数来选择手臂。我们首先分析单臂匪徒，并证明一般来说，当对“无样本”行动有正面奖励时，它会接受近似阈值类型的最优策略。我们还确定了阈值策略确实是最佳策略的几种特殊情况。接下来，我们证明这种单臂匪还满足了近似可转位性。对于单臂匪徒接受阈值类型最优策略的情况，我们对每个臂进行Whittle指数的计算。数值示例说明了分析结果。

著录项

作者
Meshram, Rahul; Manjunath, D.; Gopalan, Aditya;
展开▼
作者单位

展开▼
年度 2017
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Scheduling Periodic Real-Time Traffic in Lossy Wireless Networks as Restless Multi-Armed Bandit [J] . Jun Xu, Chengcheng Guo Wireless Communications Letters, IEEE . 2019,第4期

机译：将有损无线网络中的定期实时流量调度为躁动多臂强盗
2. AN ASYMPTOTICALLY OPTIMAL HEURISTIC FOR GENERAL NONSTATIONARY FINITE-HORIZON RESTLESS MULTI-ARMED, MULTI-ACTION BANDITS [J] . Zayas-Caban Gabriel, Jasin Stefanus, Wang Guihua Advances in applied probability . 2019,第3期

机译：一般非平稳有限范围不安的多武装，多动作匪徒的渐近最优启发式
3. Interactive Restless Multi-armed Bandit Game and Swarm Intelligence Effect [J] . Yoshida Shunsuke, Hisakado Masato, Mori Shintaro New Generation Computing . 2016,第3期

机译：互动式躁动多臂强盗游戏和群智能效应
4. A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems [C] . Rahul Meshram, Aditya Gopalan, D. Manjunath International conference on communication systems and networks . 2017

机译：播出推荐系统的隐马尔可夫躁动不安多臂土匪模型
5. Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics [D] . Liu, Haoyang 2013

机译：在瞬息万变的世界中学习：具有未知动态的躁动多臂强盗
6. Smoking and the bandit: A preliminary study of smoker and non-smoker differences in exploratory behavior measured with a multi-armed bandit task [O] . Merideth A. Addicott, John M. Pearson, Jessica Wilson, -1

机译：吸烟和强盗：用多武装强盗任务测量的探索性行为的吸烟者和非吸烟者差异的初步研究
7. A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems [O] . Meshram, Rahul, Gopalan, Aditya, Manjunath, D. 2017

机译：一种用于播出的隐马尔可夫不安多臂强盗模型推荐系统
8. Learning in A Changing World: Non-Bayesian Restless Multi-Armed Bandit [R] . Liu, H., Liu, K., Zhao, Q. 2010

机译：在变化的世界中学习：非贝叶斯不安定的多武装强盗

On the Whittle Index for Restless Multi-armed Hidden Markov Bandits

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅