首页> 外文会议>International Conference on Signal Processing and Communications >On a Class of Restless Multi-armed Bandits with Deterministic Policies
【24h】

On a Class of Restless Multi-armed Bandits with Deterministic Policies

机译:在一类与确定性政策的一类焦躁的多武装匪徒

获取原文

摘要

We describe and analyze a restless multi-armed bandit (RMAB) in which, in each time-slot, the instantaneous reward from the playing of an arm depends on the time since the arm was last played. This model is motivated by recommendation systems where the payoff from a recommendation on depends the recommendation history. For an RMAB with N arms, and known reward functions for each arm that have a finite support (akin to a maximum memory) of M steps, we characterize the optimal policy that maximizes the infinite horizon time-average of the reward. Specifically, using a weighted-graph representation of the system evolution, we show that a periodic policy is optimal. Further, we show that the optimal periodic policy can be obtained using an algorithm with polynomial time and space complexity. Some extensions to the basic model are also presented; several more are possible. RMABs with such large state spaces for the arms have not been previously considered.
机译:我们描述并分析了一个不安的多武装强盗(RMAB),其中,在每个时隙中,手臂播放的瞬时奖励取决于自从臂上播放以来的时间。该模型受到推荐系统的推荐系统,即建议的建议历史取决于建议历史。对于具有N个武器的RMAB,并且每个臂的已知奖励功能,每个臂具有有限的支持(类似于M个步骤的最大内存),我们表征了最佳策略,最大化无限的地平线时间平均奖励。具体地,使用系统演进的加权图表示,我们表明定期策略是最佳的。此外,我们表明可以使用具有多项式时间和空间复杂度的算法获得最佳定期策略。还提出了对基本模型的一些扩展;还有几种。以前没有考虑具有这种武器的大状态空间的RMAB。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号