In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in K-armed bandits after T trials is bounded by const · #xA; #xA;nfrac{{Klog (T)}}n{Delta }n, where Δ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const · $nfrac{{Klog (TDelta ^2 )}}n{Delta }n$nfrac{{Klog (TDelta ^2 )}}n{Delta }
.
展开▼
机译:在随机多武装匪徒问题中,我们考虑对Auer等人的UCB算法进行修改。 [4]。对于这种改进的算法,我们给出了关于最佳奖励的遗憾的改进界限。对于原始的UCB算法,T试验后K武装匪徒的遗憾受到const·#xA的限制; #xA; nfrac {{Klog(T)}} n {Delta} n,其中Δ衡量次优臂与最佳臂之间的距离,对于改进的UCB算法,我们在const·$ nfrac的遗憾上显示了上限{{Klog(TDelta ^ 2)}} n {Delta} n $ nfrac {{Klog(TDelta ^ 2)}} n {Delta}。
展开▼