首页> 外文会议>International Conference on Algorithmic Decision Theory >Interactive Thompson Sampling for Multi-objective Multi-armed Bandits
【24h】

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

机译:多目标多武装匪徒的交互式汤普森采样

获取原文

摘要

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting - perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap- UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.
机译:在多目标强化学习(Morl)中,基于仅限随机奖励向量,为用户的未知实用功能产生了很多关注。另一方面,在线摩尔尔,代理通常能够从用户引出偏好,使其能够直接了解其用户的实用程序功能。在本文中,我们使用多目标多武装强盗(MOMAB)设置的用户交互在线研究Morl - 也许是最基本的Morl设置。我们使用贝叶斯学习算法同时学习环境和用户。具体而言,我们提出了两种算法:Utility-Map UCB(UMAP-UCB)和交互式汤普森采样(其),并经验表明这些算法在后悔方面的性能非常近似于UCB和常规汤普森采样的遗憾从一开始的用户的地面真实实用程序功能,其胜过UMAP-UCB。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号