首页> 外文会议>International Conference on Machine Learning >Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits

【24h】

Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits

机译：通过奖励偏见的探索：随机多武装匪徒的奖励偏置最大似然估计

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE - a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Exponential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate α(t) in RBMLE, we reveal the nontrivial interplay between α(t) and the regret bound that generally applies in both the Exponential Family as well as the sub-Gaussian/Exponential family bandits. To quantify the finite-time performance, we prove that RBMLE attains order-optimality by adaptively estimating the unknown constants in the expression of α(t) for Gaussian and sub-Gaussian bandits. Extensive experiments demonstrate that the proposed RBMLE achieves empirical regret performance competitive with the state-of-the-art methods, while being more computationally efficient and scalable in comparison to the best-performing ones among them.

机译：灵感来自奖励偏见的最大似然估计方法的自适应控制，我们提出了RBMLE - 一种新型的随机多武装匪徒学习算法（SMAB）。对于广泛的Smabs，包括参数指数族和非参数次高斯/指数家庭，我们表明RBMLE产生了索引政策。为了选择RBMLE中的偏置生长速率α（t），我们揭示了α（t）与遗传相互作用之间的非动力相互作用，这通常适用于指数家庭以及子高斯/指数家庭匪徒。为了量化有限时间性能，我们证明RBMLE通过自适应地估计高斯和子高斯匪徒表达的未知常数来实现订单 - 最优性。广泛的实验表明，建议的RBMLE实现了与最先进的方法竞争的经验遗憾性能，同时与它们中最佳性能的方法相比，更加计算地有效和可扩展。

著录项

来源
《International Conference on Machine Learning》|2021年|5436-6214p|共11页
会议地点
作者
Xi Liu; Ping-Chun Hsieh; Yu-Heng Hung; Anirban Bhattacharya; P. R. Kumar;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP181-53;
关键词
入库时间 2022-08-21 10:49:09

相似文献

外文文献
中文文献
专利

1. Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards [J] . Arya Sakshi, Yang Yuhong Statistics & Probability Letters . 2020,第1期

机译：随机分配与延迟奖励的上下文多武装匪徒的非参数分配
2. Reward-biased probabilistic decision-making: Mean-field predictions and spiking simulations [J] . Daniel Marti, Gustavo Deco, Paolo Del Giudice, Neurocomputing . 2006,第10a12期

机译：奖励偏向概率决策：均值场预测和峰值模拟
3. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems [J] . Sebastien Bubeck, Nicolo Cesa-Bianchi Foundations and trends in machine learning . 2012,第1期

机译：随机和非随机多臂匪问题的遗憾分析
4. Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits [C] . Xi Liu, Ping-Chun Hsieh, Yu-Heng Hung, International Conference on Machine Learning . 2021

机译：通过奖励偏见的探索：随机多武装匪徒的奖励偏置最大似然估计
5. The Neural Computations in the Caudate Nucleus for Reward-Biased Perceptual Decision-Making [D] . Fan, Yunshu . 2019

机译：尾部核心的神经计算奖励偏见的感知决策
6. An Analysis of the Value of Information When Exploring Stochastic Discrete Multi-Armed Bandits [O] . Isaac J. Sledge, José C. Príncipe 2018

机译：探索随机离散多武装匪徒信息的价值分析
7. Parametrized Stochastic Multi-armed Bandits with Binary Rewards [O] . Jiang Chong 2010

机译：具有二进制奖励的参数化随机多臂匪徒

Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits

摘要

著录项

相似文献

相关主题

期刊订阅