...
首页> 外文期刊>Journal of the American statistical association >Statistical Inference for Online Decision Making: In a Contextual Bandit Setting
【24h】

Statistical Inference for Online Decision Making: In a Contextual Bandit Setting

机译:在线决策的统计推理:在一个上下文的强盗设置中

获取原文
获取原文并翻译 | 示例
           

摘要

Online decision making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The epsilon-greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic normality. Based on the properties of the parameter estimators, we further show that the in-sample inverse propensity weighted value estimator is asymptotically normal. We illustrate our results using simulations and an application to a news article recommendation dataset from Yahoo!.for this article are available online.
机译:在线决策问题要求我们根据增量信息制定一系列决策。常见的解决方案通常需要学习给定上下文信息的不同动作的奖励模型,然后最大化长期奖励。知道假设模型是否合理,并且模型如何在渐近感知是有意义的。我们在具有线性奖励模型的上下文强盗框架的设置下研究了这个问题。采用epsilon-贪婪的政策来解决经典的探索和开发困境。使用Martingale中央极限定理,我们表明模型参数的在线普通最小二乘估计是渐近正常的。当线性模型被遗漏时,我们使用逆倾向得分加权提出在线加权最小二乘估计,并建立其渐近正常性。基于参数估计器的属性,我们进一步表明,样本逆倾向加权值估计器是渐近正常的。我们用仿真和向新闻文章推荐数据集的应用程序说明了我们的结果,来自雅虎的数据集!。对于本文在线提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号