首页> 外文会议>International conference on algorithmic decision theory >Interactive Thompson Sampling for Multi-objective Multi-armed Bandits
【24h】

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

机译:多目标多武装土匪的交互式汤普森采样

获取原文

摘要

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting -perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.
机译:在多目标强化学习(MORL)中,仅基于随机奖励向量,非常关注为用户的未知效用函数生成最优解集。另一方面,在在线MORL中,代理通常能够从用户那里获得喜好,从而使其能够直接了解其用户的效用功能。在本文中,我们使用多目标多武装强盗(MOMAB)设置-也许是最基本的MORL设置,通过用户交互来研究在线MORL。我们使用贝叶斯学习算法来同时了解环境和用户。具体来说,我们提出了两种算法:Utility-MAP UCB(umap-UCB)和交互式汤普森采样(ITS),并从经验上证明,这些算法的性能在后悔方面非常接近UCB的后悔,并且随汤普森抽样提供了常规汤普森采样从一开始就为用户提供了地面真理工具功能,并且ITS的性能优于umap-UCB。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号