首页> 外文会议>Conference on Neural Information Processing Systems >Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems
【24h】

Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems

机译:汤普森在情节不安的强盗问题中对汤普森采样的遗憾

获取原文

摘要

Restless bandit problems are instances of non-stationary multi-armed bandits. These problems have been studied well from the optimization perspective, where the goal is to efficiently find a near-optimal policy when system parameters are known. However, very few papers adopt a learning perspective, where the parameters are unknown. In this paper, we analyze the performance of Thompson sampling in episodic restless bandits with unknown parameters. We consider a general policy map to define our competitor and prove an O(T~(1/2)) Bayesian regret bound. Our competitor is flexible enough to represent various benchmarks including the best fixed action policy, the optimal policy, the Whittle index policy, or the myopic policy. We also present empirical results that support our theoretical findings.
机译:不安的强盗问题是非静止多武装匪徒的情况。 从优化的角度来看,这些问题已经很好地研究了这些问题,其中目标是在已知系统参数时有效地找到近最佳政策。 然而,很少有论文采用学习的角度,参数未知。 在本文中,我们分析了汤普森采样在具有未知参数中的情节不安匪中的性能。 我们考虑一般策略地图来定义我们的竞争对手,并证明o(t〜(1/2))贝叶斯遗憾。 我们的竞争对手足够灵活,可以代表各种基准,包括最佳固定行动政策,最佳政策,薄片指数政策或近视政策。 我们还提出了支持我们理论发现的实证结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号