...
首页> 外文期刊>Journal of machine learning research >Thompson Sampling Guided Stochastic Searching on the Line for Deceptive Environments with Applications to Root-Finding Problems
【24h】

Thompson Sampling Guided Stochastic Searching on the Line for Deceptive Environments with Applications to Root-Finding Problems

机译:汤普森采样引导的随机搜索欺骗环境,应用于根除问题

获取原文
   

获取外文期刊封面封底 >>

       

摘要

The multi-armed bandit problem forms the foundation for solving a wide range of online stochastic optimization problems through a simple, yet effective mechanism. One simply casts the problem as a gambler who repeatedly pulls one out of N slot machine arms, eliciting random rewards. Learning of reward probabilities is then combined with reward maximization, by carefully balancing reward exploration against reward exploitation. In this paper, we address a particularly intriguing variant of the multi-armed bandit problem, referred to as the Stochastic Point Location (SPL) problem. The gambler is here only told whether the optimal arm (point) lies to the “left” or to the “right” of the arm pulled, with the feedback being erroneous with probability $1-pi$. This formulation thus targets optimization in continuous action spaces with both informative and deceptive feedback. To tackle this class of problems, we formulate a compact and scalable Bayesian representation of the solution space that simultaneously captures both the location of the optimal arm as well as the probability of receiving correct feedback. We further introduce the accompanying Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme for balancing exploration against exploitation. By learning $pi$, TS-SPL also supports deceptive environments that are lying about the direction of the optimal arm. This, in turn, allows us to address the fundamental Stochastic Root Finding (SRF) problem. Empirical results demonstrate that our scheme deals with both deceptive and informative environments, significantly outperforming competing algorithms both for SRF and SPL.
机译:多武装强盗问题通过简单但有效的机制构成解决广泛的在线随机优化问题的基础。一个简单地将问题作为一个赌徒,他们反复拉出一台插槽机武器,引出随机奖励。通过仔细平衡奖励剥削的奖励探索,将奖励概率的学习与奖励最大化相结合。在本文中,我们解决了多武装强盗问题的特别有趣的变型,称为随机点位置(SPL)问题。赌徒在这里只讲述了最佳臂(点)是否位于左侧的“左”或“右”,反馈与概率为1- PI $的反馈。因此,这种配方在连续动作空间中的优化具有信息性和欺骗性反馈。为了解决这类问题,我们制定了一个紧凑且可扩展的贝叶斯表示的解决方案空间,同时捕获最佳臂的位置以及接收正确反馈的可能性。我们进一步介绍了伴随汤普森采样引导随机点定位(TS-SPL)方案,以平衡勘探勘探。通过学习$ PI $,TS-SPL还支持欺骗性环境,这些环境涉及最佳臂的方向。反过来,这使我们能够解决基本的随机根发现(SRF)问题。经验结果表明,我们的计划涉及欺骗性和信息性环境,显着优于SRF和SPL的竞争算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号