首页> 外文会议>AAAI Conference on Artificial Intelligence >Improved Algorithms for Conservative Exploration in Bandits
【24h】

Improved Algorithms for Conservative Exploration in Bandits

机译:匪徒保守勘探的改进算法

获取原文

摘要

In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LIN-UCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.
机译:在许多领域,如数字营销,医疗保健,金融和机器人等,共同拥有在生产中运行的经过良好测试和可靠的基线政策(例如,推荐系统)。尽管如此,基线政策通常是次优。在这种情况下,期望部署在线学习算法(例如,多武装强盗算法),其与系统交互以在学习过程期间的约束下学习更好/最佳的政策,性能几乎永远不会差基线本身的性能。在本文中,我们研究了在语境线性强盗设置中的保守学习问题,并引入了一种新颖的算法,保守受限的LIN-UCB(Clucb2)。我们为Clucb2派生了遗憾的界限,与现有结果相匹配,并经验证明它在许多合成和真实问题中表现出最先进的保守强盗算法。最后,我们考虑一个更现实的约束,在预定义检查点(而不是每一步)验证性能的更现实的约束,并显示这种放松的约束如何有利地影响Clucb2的遗憾和经验表现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号