We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret $mathcal{O}(sqrt{T})$ for learning over $T$ rounds, when the total error of the predicted losses relative to the realized losses, denoted as $mathcal{E} leq T$, is relatively small. We provide a complete answer to this question, with upper and lower bounds for various settings: adversarial and stochastic environments, known and unknown $mathcal{E}$, and single and multiple predictors. We show several surprising results, such as 1) the optimal regret is $mathcal{O}(min{sqrt{T}, sqrt{mathcal{E}}T^rac{1}{4}})$ when $mathcal{E}$ is known, in contrast to the standard and better bound $mathcal{O}(sqrt{mathcal{E}})$ for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if $mathcal{E}$ is unknown, but as a remedy, $mathcal{O}(sqrt{mathcal{E}}T^rac{1}{3})$ is achievable; 3) with $M$ predictors, a linear dependence on $M$ is necessary, even though logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key emph{action remapping} technique for optimal regret with known $mathcal{E}$, 2) computationally efficient implementation of Catoni’s robust mean estimator via an ERM oracle in the stochastic setting with optimal regret, 3) an underestimator for $mathcal{E}$ via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown $mathcal{E}$, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest.
展开▼