首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Semiparametric Contextual Bandits
【24h】

Semiparametric Contextual Bandits

机译:半参数上下文强盗

获取原文
           

摘要

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for a chosen action is modeled as a linear function of known action features confounded by a non-linear action-independent term. We design new algorithms that achieve $ilde{O}(dsqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenwald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.
机译:本文研究了半参数语境匪徒,这是线性随机匪徒问题的一般化,其中对选定动作的奖励被建模为已知动作特征的线性函数,该函数与非线性动作无关项混淆。我们设计了新的算法,当线性函数为$ d $维时,它会在$ T $次回合中实现$ tilde {O}(d sqrt {T})$后悔,这与最简单的无混淆情况匹配最著名的范围并改进了Greenwald等人的最新结果。 (2017)。通过经验评估,我们发现当奖励存在非线性混杂影响时,我们的算法优于先前的方法。从技术上讲,我们的算法使用了一种受双重稳健方法启发的新的奖励估算器,而我们的证明要求自归一化mar的新浓度不等式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号