On a Non-asymptotic Analysis Using Large Deviation Principles in the Multiarmed Bandit Problem

Junya HONDA; Akimichi TAKEMURA

首页> 外文期刊>電子情報通信学会技術研究報告 >On a Non-asymptotic Analysis Using Large Deviation Principles in the Multiarmed Bandit Problem

【24h】

On a Non-asymptotic Analysis Using Large Deviation Principles in the Multiarmed Bandit Problem

机译：多臂匪问题中使用大偏差原理的非渐近分析

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

多腕バンデイツト問題は強化学習における知識の探索と活用のジレンマを定式化したもので,複数台のスロットマシンを選んでプレイするギャンブラーのモデルとして表される.本論文では各マシンからの報酬が区間[0,1]上の確率分布にしたがう場合をまず考える.このモデルにおいて理論限界を達成するものとしてDMED戦略が近年提案されたが,その評価は渐近論に大きく依存しており有限試行回数での性能評価は知られていない.そこで本研究ではKLダイパージエンスの挙動に関する漸近形でない大偏差原理を導出することによりDMED戦略の有限試行回数での性能評価を行ラ.さらに,報酬の分布のサポートが下側が非有界の場合であっても積率母関数が存在する場合には理論限界を達成可能であることを合わせて示す.%In reinforcement learning a tradeoff between exploration and exploitation is considered. Multiarmed bandit problems formulate this dilemma as a model of a gambler playing a slot machine with multiple arms. In this paper we first consider a stochastic bandit such that each arm has a reward distribution supported in a known interval, e.g., [0,1]. Recently a policy, DMED, is proposed and proved to achieve the asymptotic bound for the model. However, the derived regret bound is described in an asymptotic form and the performance in finite time has been unknown. We inspect this policy and derive a finite-time regret bound by refining large deviation probabilities to a simple finite form. Further, this observation reveals that the assumption on the lower-boundedness of the support is not essential and can be replaced with a weaker one, the existence of the moment generating function.

机译：多臂带日期问题是强化学习中知识搜索和利用困境的形式化形式，并表示为通过选择多个老虎机进行游戏的赌徒模型。首先，考虑在[0,1]上遵循概率分布的情况，近年来提出了DMED策略以达到该模型的理论极限。因此，在这项研究中，我们通过推导不是KL吹气行为的渐近形式的大偏差原理，通过有限次数的试验评估了DMED策略的性能。我们还表明，即使产品分配矩的下限不受限制，如果存在产品矩生成函数，也可以达到理论极限。％在强化学习中，要考虑勘探与开发之间的权衡多臂匪徒问题使这个困境成为赌徒玩多臂老虎机的模型。在本文中，我们首先考虑一个随机匪徒，使得每个臂具有在已知间隔内支持的奖励分布，例如[0,1 ]。最近，提出了一种策略DMED，并证明了该模型可以实现模型的渐近界。但是，以渐近形式描述导出的后悔界，并且在有限时间内的性能未知。通过细化大偏差概率来限制有限时间后悔此外，该观察结果表明，关于支撑的下界的假设不是必须的，并且可以用较弱的假设代替，即力矩生成函数的存在。

著录项

来源
《電子情報通信学会技術研究報告》 |2012年第83期|65-72|共8页
作者
Junya HONDA; Akimichi TAKEMURA;
展开▼
作者单位

Graduate School of Frontier Sciences, The University of Tokyo Kashiwanoha 5-1-5, Kashiwa-shi, Chiba, 277-8561, Japan;

Graduate School of Information Science and Technology, The University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo, 113-8656, Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
multiarmed bandit problem; reinforcement learning; large deviation principle; moment generating function;

机译：多臂强盗问题;强化学习;大偏差原理;力矩产生功能;
入库时间 2022-08-18 00:29:09

相似文献

外文文献
中文文献
专利

1. On a Non-asymptotic Analysis Using Large Deviation Principles in the Multiarmed Bandit Problem [J] . Junya HONDA, Akimichi TAKEMURA 電子情報通信学会技術研究報告. 情報論的学習理論と機械学習 . 2012,第83期

机译：多臂匪问题中使用大偏差原理的非渐近分析
2. Finite-time Analysis of the Multiarmed Bandit Problem [J] . Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer Machine Learning . 2002,第2a3期

机译：多臂强盗问题的有限时间分析
3. Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards [J] . Junya Honda, Akimichi Takemura Journal of machine learning research . 2015,第Apr期

机译：一种新的半有界奖励的强盗算法的非渐近分析
4. Cluster Analysis Based on the Central Tendency Deviation Principle [C] . Julien Ah-Pine Advanced data mining and applications . 2009

机译：基于中心趋势偏差原理的聚类分析
5. Applications of large deviations principles to options pricing and portfolio choice [D] . Robertson, Scott 2009

机译：大偏差原理在期权定价和投资组合选择中的应用
6. Nash Equilibrium of Social-Learning Agents in a Restless Multiarmed Bandit Game [O] . Kazuaki Nakayama, Masato Hisakado, Shintaro Mori -1

机译：躁动多臂强盗游戏中的社会学习代理人的纳什均衡
7. PAC-Bayesian Analysis of Martingales and Multiarmed Bandits [O] . Seldin, Yevgeny, Laviolette, François, Shawe-Taylor, John, 2011

机译：鞅和多臂匪的paC-Bayesian分析

On a Non-asymptotic Analysis Using Large Deviation Principles in the Multiarmed Bandit Problem

摘要

著录项

相似文献

相关主题

期刊订阅