首页> 外文期刊>Theoretical computer science >Regret bounds for restless Markov bandits
【24h】

Regret bounds for restless Markov bandits

机译:焦躁不安的马尔科夫匪徒

获取原文
获取原文并翻译 | 示例
       

摘要

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm, that first represents the setting as an MDP which exhibits some special structural properties. In order to grasp this information we introduce the notion of ε-structured MDPs, which are a generalization of concepts like (approximate) state aggregation and MDP homomorphisms. We propose a general algorithm for learning ε-structured MDPs and show regret bounds that demonstrate that additional structural information enhances learning. Applied to the restless bandit setting, this algorithm achieves after any T steps regret of order O({the square root of}T) with respect to the best policy that knows the distributions of all arms. We make no assumptions on the Markov chains underlying each arm except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.
机译:我们考虑了躁动不安的马尔可夫强盗问题,其中每个臂的状态根据马尔可夫过程演化而独立于学习者的行为。我们建议一种算法,该算法首先将设置表示为具有某些特殊结构属性的MDP。为了掌握这些信息,我们介绍了ε结构MDP的概念,它是对(近似)状态聚合和MDP同态的概念的概括。我们提出了一种学习ε-结构化MDP的通用算法,并展示了后悔界限,证明了额外的结构信息可增强学习效果。应用于不安定的匪徒设置,该算法在知道所有武器分配的最佳策略的任何T步后阶O({T的平方根)后悔之后都可以实现。我们对每个臂下的马尔可夫链没有任何假设,除非它们是不可约的。此外,我们表明,对于所考虑的问题,基于索引的策略不一定是次优的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号