Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

机译：多目标多武装匪徒的交互式汤普森采样

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting - perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap- UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.

机译：在多目标强化学习（Morl）中，基于仅限随机奖励向量，为用户的未知实用功能产生了很多关注。另一方面，在线摩尔尔，代理通常能够从用户引出偏好，使其能够直接了解其用户的实用程序功能。在本文中，我们使用多目标多武装强盗（MOMAB）设置的用户交互在线研究Morl - 也许是最基本的Morl设置。我们使用贝叶斯学习算法同时学习环境和用户。具体而言，我们提出了两种算法：Utility-Map UCB（UMAP-UCB）和交互式汤普森采样（其），并经验表明这些算法在后悔方面的性能非常近似于UCB和常规汤普森采样的遗憾从一开始的用户的地面真实实用程序功能，其胜过UMAP-UCB。

著录项

来源
《International Conference on Algorithmic Decision Theory》|2017年|390p|共17页
会议地点
作者
Diederik M. Roijers; Luisa M. Zintgraf; Ann Nowe;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301.6-53;
关键词

相似文献

外文文献
中文文献
专利

1. Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms [J] . Alihan Huyuk, Cem Tekin JMLR: Workshop and Conference Proceedings . 2018,第2010期

机译：概率触发臂的组合式多臂匪的汤普森采样分析
2. Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms [J] . Alihan Huyuk, Cem Tekin JMLR: Workshop and Conference Proceedings . 2018,第2009期

机译：概率触发臂的组合式多臂匪的汤普森采样分析
3. Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives [J] . Huyuk Alihan, Tekin Cem Machine Learning . 2021,第6期

机译：具有词典排序和满足目标的多目标多武装匪
4. Interactive Thompson Sampling for Multi-objective Multi-armed Bandits [C] . Diederik M. Roijers, Luisa M. Zintgraf, Ann Nowe International conference on algorithmic decision theory . 2017

机译：多目标多武装土匪的交互式汤普森采样
5. Thompson Sampling for Bandit Problems [D] . Liu, Che-Yu. 2018

机译：汤普森抽样问题
6. Smoking and the bandit: A preliminary study of smoker and non-smoker differences in exploratory behavior measured with a multi-armed bandit task [O] . Merideth A. Addicott, John M. Pearson, Jessica Wilson, -1

机译：吸烟和强盗：用多武装强盗任务测量的探索性行为的吸烟者和非吸烟者差异的初步研究
7. Multi-objective Contextual Multi-armed Bandit Problem with a Dominant Objective [O] . Tekin, Cem, Turgay, Eralp 2017

机译：具有显性特征的多目标语境多臂强盗问题目的

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

摘要

著录项

相似文献

相关主题

期刊订阅