...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Randomized Exploration for Non-Stationary Stochastic Linear Bandits
【24h】

Randomized Exploration for Non-Stationary Stochastic Linear Bandits

机译:非静止随机线性匪徒随机探索

获取原文
           

摘要

We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $ilde{O}(d ^{2/3}B_T^{1/3} T^{2/3})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show the outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.
机译:我们调查了两种扰动方法,以克服保守主义,即基于乐观的算法在实践中营养。第一个方法在使用信心集时用简单的随机化取代乐观主义。第二个在最大化预期奖励之前,将随机扰动增加到其当前估计。对于非静止线性匪徒,其中每个动作与$ d $ -dimensional特征相关联,未知参数与全文总变化的时间变化$ b_t $,我们提出了两个随机算法,折扣随机的linucb(d-randlinucb)和通过两种扰动方法折扣线性汤普森采样(D-Lints)。我们突出了统计最优性与计算效率折衷,因为前者渐近地实现了最佳动态遗憾$ tilde {o}(d ^ {2/3} b_t ^ {1/3} t ^ {2/3 })$,但后者是Oracle高效,与武器数量的额外对数因子相比,与Minimax-Optimal动态遗憾相比。在模拟研究中,这两种算法都表现出折扣Linuxb斗争的保守主义问题中的出色表现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号