...
首页> 外文期刊>Information Theory, IEEE Transactions on >Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics
【24h】

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

机译:在瞬息万变的世界中学习:具有未知动态的躁动多臂匪徒

获取原文
获取原文并翻译 | 示例

摘要

We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of $N$ arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.
机译:我们考虑具有未知动态的不安定多臂匪问题,其中玩家每次选择从$ N $武器中选择一个。演奏时,每条手臂的奖励状态都会根据未知的马尔可夫规则进行转换,而在被动时,则会根据任意未知的随机过程进行演化。手臂选择策略的执行情况通过遗憾来衡量,后者定义为相对于玩家知道哪条手臂是最有价值的并且始终使用最佳手臂的情况下的奖励损失。我们构建了一个具有交错的勘探和开发时代结构的政策,该政策以对数顺序实现了遗憾。我们将问题进一步扩展到分散的环境,其中多个分布式参与者共享武器而无需信息交换。在外生不安定模型和内生不安定模型下,我们都表明,拟议政策的分散扩展保留了集中式设置中的对数后悔顺序。结果适用于各种动态系统和通信网络中的自适应学习,以及金融投资。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号