The well-known "multi-armed bandit" problem is a problem of iterated decision making in a largely unknown and constantly changing environment. A decision-maker must repeatedly select one of k decisions, and incur a corresponding loss, with the eventual goal of minimizing the "regret," defined as the difference between the actual total loss of an online decision-making procedure and that of the best single decision in hindsight. This problem has been well studied, and the best algorithms achieve the optimal regret of kT (ignoring polylogarithmic factors), where T is the number of rounds for which the game is played.; We study the problem of "bandit online linear optimization," which is a generalization of the multi-armed bandit problem to a setting in which the decision set is a compact subset of Rn and the loss functions to be optimized are linear. Although the algorithms from the multi-armed bandit setting can be directly applied here, they suffer from the dependence of their regret on the size of the decision set, since this may be very large or even infinite. The aim, rather, is to find algorithms whose regret is controlled by the ambient dimension n. "Bandit" refers to the setting in which the algorithm is only told its loss on each round, rather than the entire loss function. Prior to our work, algorithms for the bandit setting with bounded linear loss functions, achieved a regret bound that was polynomial in n by sacrificing their dependence on the number of rounds, the best previous bounds being of order poly(n)T2/3. We bridge this gap by presenting an algorithm that has expected regret at most n3/2 T (ignoring polylogarithmic factors). We also present a modification of our algorithm which achieves the same regret bound with high probability.; We provide lower bounds to show that no algorithm for bandit online linear optimization can achieve lower regret than O(n T ). An easy corollary of earlier work of Freund and Schapire shows that in the full-information setting, (i.e., when the algorithm observes the entire loss vector each round) a regret bound of O( nT ) is possible. Thus we show that for online linear optimization, the price of bandit information, i.e., the extra regret factor due to not observing the entire loss function is between n and n.; Another variant of the problem has the loss functions drawn from some fixed but unknown distribution. For this case we present a deterministic algorithm which achieves a regret of O(n T ) in general. In the special case when the decision set is a fixed polytope, this algorithm achieves regret O(n 2 log3 T). We also give a lower bound of O( nT ) on regret for general compact decision sets.
展开▼