首页> 外文会议>International Conference on Machine Learning >Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits
【24h】

Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits

机译:通过奖励偏见的探索:随机多武装匪徒的奖励偏置最大似然估计

获取原文

摘要

Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE - a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Exponential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate α(t) in RBMLE, we reveal the nontrivial interplay between α(t) and the regret bound that generally applies in both the Exponential Family as well as the sub-Gaussian/Exponential family bandits. To quantify the finite-time performance, we prove that RBMLE attains order-optimality by adaptively estimating the unknown constants in the expression of α(t) for Gaussian and sub-Gaussian bandits. Extensive experiments demonstrate that the proposed RBMLE achieves empirical regret performance competitive with the state-of-the-art methods, while being more computationally efficient and scalable in comparison to the best-performing ones among them.
机译:灵感来自奖励偏见的最大似然估计方法的自适应控制,我们提出了RBMLE - 一种新型的随机多武装匪徒学习算法(SMAB)。对于广泛的Smabs,包括参数指数族和非参数次高斯/指数家庭,我们表明RBMLE产生了索引政策。为了选择RBMLE中的偏置生长速率α(t),我们揭示了α(t)与遗传相互作用之间的非动力相互作用,这通常适用于指数家庭以及子高斯/指数家庭匪徒。为了量化有限时间性能,我们证明RBMLE通过自适应地估计高斯和子高斯匪徒表达的未知常数来实现订单 - 最优性。广泛的实验表明,建议的RBMLE实现了与最先进的方法竞争的经验遗憾性能,同时与它们中最佳性能的方法相比,更加计算地有效和可扩展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号