首页> 外文期刊>BioSystems >Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?
【24h】

Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?

机译:基于筛沙机制的超车方法:为什么乐观价值函数会在多臂匪徒问题中找到最优解?

获取原文
获取原文并翻译 | 示例
           

摘要

A multi-armed bandit problem is a search problem on which a learning agent must select the optimal arm among multiple slot machines generating random rewards. UCB algorithm is one of the most popular methods to solve multi-armed bandit problems. It achieves logarithmic regret performance by coordinating balance between exploration and exploitation. Since UCB algorithms, researchers have empirically known that optimistic value functions exhibit good performance in multi-armed bandit problems. The terms optimistic or optimism might suggest that the value function is sufficiently larger than the sample mean of rewards. The first definition of UCB algorithm is focused on the optimization of regret, and it is not directly based on the optimism of a value function. We need to think the reason why the optimism derives good performance in multi-armed bandit problems. In the present article, we propose a new method, which is called Overtaking method, to solve multi-armed bandit problems. The value function of the proposed method is defined as an upper bound of a confidence interval with respect to an estimator of expected value of reward: the value function asymptotically approaches to the expected value of reward from the upper bound. If the value function is larger than the expected value under the asymptote, then the learning agent is almost sure to be able to obtain the optimal arm. This structure is called sand-sifter mechanism, which has no regrowth of value function of suboptimal arms. It means that the learning agent can play only the current best arm in each time step. Consequently the proposed method achieves high accuracy rate and low regret and some value functions of it can outperform UCB algorithms. This study suggests the advantage of optimism of agents in uncertain environment by one of the simplest frameworks. (C) 2015 The Authors. Published by Elsevier Ireland Ltd.
机译:多臂强盗问题是一种搜索问题,学习代理必须在该问题上从产生随机奖励的多台老虎机中选择最佳臂。 UCB算法是解决多武装匪徒问题的最流行方法之一。它通过协调勘探与开发之间的平衡来实现对数后悔表现。自从UCB算法以来,研究人员凭经验知道乐观值函数在多臂匪徒问题中表现出良好的性能。术语“乐观”或“乐观”可能表明价值函数比奖励的样本均值足够大。 UCB算法的第一个定义侧重于后悔的优化,而不是直接基于价值函数的乐观性。我们需要思考为什么乐观主义在多臂匪徒问题上表现良好的原因。在本文中,我们提出了一种解决超武装匪徒问题的新方法,称为超车方法。所提出的方法的价值函数被定义为相对于奖励期望值的估计值的置信区间的上限:价值函数从上界渐近地接近奖励期望值。如果值函数大于渐近线下的期望值,则学习代理几乎可以确定能够获得最佳范围。这种结构称为筛沙机制,它没有次优武器价值功能的再生。这意味着学习代理在每个时间步中只能扮演当前最好的手臂。因此,该方法具有较高的准确率和较低的后悔性,并且其某些值函数可以优于UCB算法。这项研究表明,通过最简单的框架之一,在不确定的环境中乐观对待代理人是有好处的。 (C)2015作者。由Elsevier Ireland Ltd.发布

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号