Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?

Ochi Kento; Kamiura Moto

首页> 外文期刊>BioSystems >Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?

【24h】

Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?

机译：基于筛沙机制的超车方法：为什么乐观价值函数会在多臂匪徒问题中找到最优解？

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

A multi-armed bandit problem is a search problem on which a learning agent must select the optimal arm among multiple slot machines generating random rewards. UCB algorithm is one of the most popular methods to solve multi-armed bandit problems. It achieves logarithmic regret performance by coordinating balance between exploration and exploitation. Since UCB algorithms, researchers have empirically known that optimistic value functions exhibit good performance in multi-armed bandit problems. The terms optimistic or optimism might suggest that the value function is sufficiently larger than the sample mean of rewards. The first definition of UCB algorithm is focused on the optimization of regret, and it is not directly based on the optimism of a value function. We need to think the reason why the optimism derives good performance in multi-armed bandit problems. In the present article, we propose a new method, which is called Overtaking method, to solve multi-armed bandit problems. The value function of the proposed method is defined as an upper bound of a confidence interval with respect to an estimator of expected value of reward: the value function asymptotically approaches to the expected value of reward from the upper bound. If the value function is larger than the expected value under the asymptote, then the learning agent is almost sure to be able to obtain the optimal arm. This structure is called sand-sifter mechanism, which has no regrowth of value function of suboptimal arms. It means that the learning agent can play only the current best arm in each time step. Consequently the proposed method achieves high accuracy rate and low regret and some value functions of it can outperform UCB algorithms. This study suggests the advantage of optimism of agents in uncertain environment by one of the simplest frameworks. (C) 2015 The Authors. Published by Elsevier Ireland Ltd.

机译：多臂强盗问题是一种搜索问题，学习代理必须在该问题上从产生随机奖励的多台老虎机中选择最佳臂。 UCB算法是解决多武装匪徒问题的最流行方法之一。它通过协调勘探与开发之间的平衡来实现对数后悔表现。自从UCB算法以来，研究人员凭经验知道乐观值函数在多臂匪徒问题中表现出良好的性能。术语“乐观”或“乐观”可能表明价值函数比奖励的样本均值足够大。 UCB算法的第一个定义侧重于后悔的优化，而不是直接基于价值函数的乐观性。我们需要思考为什么乐观主义在多臂匪徒问题上表现良好的原因。在本文中，我们提出了一种解决超武装匪徒问题的新方法，称为超车方法。所提出的方法的价值函数被定义为相对于奖励期望值的估计值的置信区间的上限：价值函数从上界渐近地接近奖励期望值。如果值函数大于渐近线下的期望值，则学习代理几乎可以确定能够获得最佳范围。这种结构称为筛沙机制，它没有次优武器价值功能的再生。这意味着学习代理在每个时间步中只能扮演当前最好的手臂。因此，该方法具有较高的准确率和较低的后悔性，并且其某些值函数可以优于UCB算法。这项研究表明，通过最简单的框架之一，在不确定的环境中乐观对待代理人是有好处的。（C）2015作者。由Elsevier Ireland Ltd.发布

著录项

来源
《BioSystems》 |2015年第null期|共11页
作者
Ochi Kento; Kamiura Moto;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物科学;
关键词
Exploration-exploitation dilemma; Multi-armed bandit problem; Confidence interval; UCB algorithm; Optimism;

机译：勘探开发困境;多武装匪徒问题;置信区间;UCB算法;乐观;

相似文献

外文文献
中文文献
专利

1. Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems? [J] . Ochi Kento, Kamiura Moto BioSystems . 2015,第Null期

机译：基于筛沙机制的超车方法：为什么乐观价值函数会在多臂匪徒问题中找到最优解？
2. A fuzzy set based solution method for multiobjective optimal design problem of mechanical and structural systems using functional-link net [J] . Hong-Zhong Huang, Ping Wang, Ming J. Zuo, Neural computing & applications . 2006,第3a4期

机译：功能链接网的机械和结构系统多目标优化设计问题的基于模糊集的求解方法
3. Estimation of an optimal solution of a LP problem with unknown objective function - A stochastic approximation approach based on the simplex method [J] . Prieto-Rumeau T Mathematical Programming . 2004,第3期

机译：目标函数未知的LP问题的最优解的估计-基于单纯形法的随机逼近方法
4. A new approach to the solution of the optimal power flow problem based on the modified newton's method associated to an augmented lagrangian function [C] . G. R. M. da Costa, K. Langona, D. A. Alves International conference on power system technology;POWERCON'98 . 1998

机译：基于与扩展拉格朗日函数相关的改进牛顿法的最优潮流问题求解的新方法
5. A Value-Function Based Method for Incorporating Ensemble Forecasts in Real-Time Optimal Reservoir Operations [D] . ?Peacock, Matthew E. 2020

机译：在实时优化水库运营纳入集合预报的一个的价值函数为基础的方法
6. Revisiting the multi-armed bandit model for the optimal design of clinical trials: benefits and drawbacks [O] . Sofia S Villar, Jack Bowden, James Wason 2013

机译：重新审视多臂土匪模型以优化临床试验设计：利弊
7. Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems? [O] . Ochi Kento, Kamiura Moto 2015

机译：基于筛沙机制的超车方法：为什么乐观值函数会在多臂匪徒问题中找到最优解？

Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?

摘要

著录项

相似文献

相关主题

期刊订阅