首页> 外文学位 >Algorithms for bandit online linear optimization.
【24h】

Algorithms for bandit online linear optimization.

机译:土匪在线线性优化算法。

获取原文
获取原文并翻译 | 示例

摘要

The well-known "multi-armed bandit" problem is a problem of iterated decision making in a largely unknown and constantly changing environment. A decision-maker must repeatedly select one of k decisions, and incur a corresponding loss, with the eventual goal of minimizing the "regret," defined as the difference between the actual total loss of an online decision-making procedure and that of the best single decision in hindsight. This problem has been well studied, and the best algorithms achieve the optimal regret of kT (ignoring polylogarithmic factors), where T is the number of rounds for which the game is played.; We study the problem of "bandit online linear optimization," which is a generalization of the multi-armed bandit problem to a setting in which the decision set is a compact subset of Rn and the loss functions to be optimized are linear. Although the algorithms from the multi-armed bandit setting can be directly applied here, they suffer from the dependence of their regret on the size of the decision set, since this may be very large or even infinite. The aim, rather, is to find algorithms whose regret is controlled by the ambient dimension n. "Bandit" refers to the setting in which the algorithm is only told its loss on each round, rather than the entire loss function. Prior to our work, algorithms for the bandit setting with bounded linear loss functions, achieved a regret bound that was polynomial in n by sacrificing their dependence on the number of rounds, the best previous bounds being of order poly(n)T2/3. We bridge this gap by presenting an algorithm that has expected regret at most n3/2 T (ignoring polylogarithmic factors). We also present a modification of our algorithm which achieves the same regret bound with high probability.; We provide lower bounds to show that no algorithm for bandit online linear optimization can achieve lower regret than O(n T ). An easy corollary of earlier work of Freund and Schapire shows that in the full-information setting, (i.e., when the algorithm observes the entire loss vector each round) a regret bound of O( nT ) is possible. Thus we show that for online linear optimization, the price of bandit information, i.e., the extra regret factor due to not observing the entire loss function is between n and n.; Another variant of the problem has the loss functions drawn from some fixed but unknown distribution. For this case we present a deterministic algorithm which achieves a regret of O(n T ) in general. In the special case when the decision set is a fixed polytope, this algorithm achieves regret O(n 2 log3 T). We also give a lower bound of O( nT ) on regret for general compact decision sets.
机译:众所周知的“多武装匪徒”问题是在很大程度上未知且不断变化的环境中反复做出决策的问题。决策者必须反复选择k个决策中的一个,并造成相应的损失,最终目标是使“遗憾”最小化,“遗憾”定义为在线决策过程的实际总损失与最佳决策损失之间的差额。事后决定。这个问题已经被很好地研究,并且最好的算法达到了kT的最佳后悔(忽略多对数因子),其中T是玩游戏的回合数。我们研究了“强盗在线线性优化”问题,该问题是多臂强盗问题对决策集为Rn的紧凑子集且要优化的损失函数为线性的设置的推广。尽管多武装匪徒设置的算法可以在此处直接应用,但由于后悔可能取决于决策集的大小,因此它们可能会受其困扰,因为这可能非常大甚至无限。相反,目的是找到其后悔由环境维数n控制的算法。 “强盗”是指仅告知算法每轮损失的算法,而不是整个损失函数的设置。在我们进行工作之前,具有有限线性损失函数的匪徒设置算法通过牺牲舍入对轮数的依赖关系来实现了n多项式的后悔边界,最好的先前边界为poly(n)T2 / 3阶。我们通过提出一种在最多n3 / 2 T的情况下会后悔的算法(忽略多对数因子)来弥合这一差距。我们还提出了对我们算法的修改,该修改以很高的概率实现了相同的遗憾。我们提供了下界,以表明没有任何一种用于强盗在线线性优化的算法可以实现比O(n T)更低的遗憾。 Freund和Schapire早期工作的一个简单推论表明,在完整信息设置中(即,当算法每轮观察到整个损失矢量时),O(nT)的后悔界限是可能的。因此,我们表明对于在线线性优化,匪徒信息的价格,即由于未观察到整个损失函数而引起的额外后悔因素在n和n之间。问题的另一个变体是从某些固定但未知的分布中提取损失函数。对于这种情况,我们提出了一种确定性算法,该算法通常可以实现O(n T)的后悔。在特殊情况下,当决策集是固定的多态性时,此算法可实现后悔O(n 2 log3 T)。对于一般紧凑型决策集,我们还对遗憾给出O(nT)的下限。

著录项

  • 作者

    Dani, Varsha.;

  • 作者单位

    The University of Chicago.;

  • 授予单位 The University of Chicago.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 91 p.
  • 总页数 91
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号