首页> 外文会议>IEEE Canadian Conference on Electrical and Computer Engineering >An estimation based allocation rule with super-linear regret and finite lock-on time for time-dependent multi-armed bandit processes
【24h】

An estimation based allocation rule with super-linear regret and finite lock-on time for time-dependent multi-armed bandit processes

机译:基于估计的具有超线性后悔和有限锁定时间的分配规则,用于依赖时间的多臂匪徒程序

获取原文

摘要

The multi-armed bandit (MAB) problem has been an active area of research since the early 1930s. The majority of the literature restricts attention to i.i.d. or Markov reward processes. In this paper, the finite-parameter MAB problem with time-dependent reward processes is investigated. An upper confidence bound (UCB) based index policy, where the index is computed based on the maximum-likelihood estimate of the unknown parameter, is proposed. This policy locks on to the optimal arm in finite expected time but has a super-linear regret. As an example, the proposed index policy is used for minimizing prediction error when each arm is a auto-regressive moving average (ARMA) process.
机译:自1930年代初以来,多臂匪(MAB)问题一直是研究的一个活跃领域。大多数文献将注意力限制在i.d.或马尔可夫奖赏过程。本文研究了具有时变奖励过程的有限参数MAB问题。提出了一种基于上置信界(UCB)的索引策略,其中该索引是基于未知参数的最大似然估计来计算的。该策略会在有限的预期时间内锁定最佳臂,但会产生超线性的遗憾。例如,当每个手臂是自回归移动平均(ARMA)过程时,建议的索引策略用于最小化预测误差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号