...
首页> 外文期刊>Journal of machine learning research >Bandit Convex Optimization in Non-stationary Environments
【24h】

Bandit Convex Optimization in Non-stationary Environments

机译:非静止环境中的强盗凸优化

获取原文
           

摘要

Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values. In this paper, we investigate BCO in non-stationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path-length of the comparator sequence that reflects the non-stationarity of environments. We propose a novel algorithm that achieves $O(T^{3/4}(1+P_T)^{1/2})$ and $O(T^{1/2}(1+P_T)^{1/2})$ dynamic regret respectively for the one-point and two-point feedback models. The latter result is optimal, matching the $Omega(T^{1/2}(1+P_T)^{1/2})$ lower bound established in this paper. Notably, our algorithm is adaptive to the non-stationary environments since it does not require prior knowledge of the path-length $P_T$ ahead of time, which is generally unknown. We further extend the algorithm to an anytime version that does not require to know the time horizon $T$ in advance. Moreover, we study the adaptive regret, another widely used performance measure for online learning in non-stationary environments, and design an algorithm that provably enjoys the adaptive regret guarantees for BCO problems. Finally, we present empirical studies to validate the effectiveness of the proposed approach.
机译:BANDIT凸优化(BCO)是用于使用部分信息建模连续决策的基本框架,其中播放器可用的唯一反馈是单点或两点函数值。在本文中,我们在非静止环境中调查BCO,并选择动态遗憾作为性能测量,定义为算法产生的累积损失与任何可行比较器序列之间的差异。让$ T $是时间范围和$ P_T $是反映环境的非实用性的比较器序列的路径长度。我们提出了一种新颖的算法,实现$ O(t ^ {3/4}(1 + p_t)^ {1/2})$和$ o(t ^ {1/2}(1 + p_t)^ {1 / 2})分别为单点和两点反馈模型的动态遗憾。后一个结果是最佳的,匹配$ omega(t ^ {1/2}(1 + p_t)^ {1/2})$下限于本文建立。值得注意的是,我们的算法对非静止环境自适应,因为它不需要提前的路径长度$ P_T $的先验知识,这通常是未知的。我们进一步将算法扩展到任何不需要提前了解时间范围$ $ $ $ $ $ $ $ $ $ t $ $ t的任何时间版本。此外,我们研究了适应性遗憾,另一个广泛使用的在线学习中的在线学习中的绩效措施,并设计了一种可证明的算法,可享受BCO问题的自适应遗憾担保。最后,我们展示了实证研究来验证提出的方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号