Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

机译：多目标多武装土匪的交互式汤普森采样

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting -perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.

机译：在多目标强化学习（MORL）中，仅基于随机奖励向量，非常关注为用户的未知效用函数生成最优解集。另一方面，在在线MORL中，代理通常能够从用户那里获得喜好，从而使其能够直接了解其用户的效用功能。在本文中，我们使用多目标多武装强盗（MOMAB）设置-也许是最基本的MORL设置，通过用户交互来研究在线MORL。我们使用贝叶斯学习算法来同时了解环境和用户。具体来说，我们提出了两种算法：Utility-MAP UCB（umap-UCB）和交互式汤普森采样（ITS），并从经验上证明，这些算法的性能在后悔方面非常接近UCB的后悔，并且随汤普森抽样提供了常规汤普森采样从一开始就为用户提供了地面真理工具功能，并且ITS的性能优于umap-UCB。

著录项

来源
《International conference on algorithmic decision theory》|2017年|18-34|共17页
会议地点
作者
Diederik M. Roijers; Luisa M. Zintgraf; Ann Nowe;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms [J] . Alihan Huyuk, Cem Tekin JMLR: Workshop and Conference Proceedings . 2018,第2010期

机译：概率触发臂的组合式多臂匪的汤普森采样分析
2. Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms [J] . Alihan Huyuk, Cem Tekin JMLR: Workshop and Conference Proceedings . 2018,第2009期

机译：概率触发臂的组合式多臂匪的汤普森采样分析
3. Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives [J] . Huyuk Alihan, Tekin Cem Machine Learning . 2021,第6期

机译：具有词典排序和满足目标的多目标多武装匪
4. Interactive Thompson Sampling for Multi-objective Multi-armed Bandits [C] . Diederik M. Roijers, Luisa M. Zintgraf, Ann Nowe International Conference on Algorithmic Decision Theory . 2017

机译：多目标多武装匪徒的交互式汤普森采样
5. Thompson Sampling for Bandit Problems [D] . Liu, Che-Yu. 2018

机译：汤普森抽样问题
6. Smoking and the bandit: A preliminary study of smoker and non-smoker differences in exploratory behavior measured with a multi-armed bandit task [O] . Merideth A. Addicott, John M. Pearson, Jessica Wilson, -1

机译：吸烟和强盗：用多武装强盗任务测量的探索性行为的吸烟者和非吸烟者差异的初步研究
7. Multi-objective Contextual Multi-armed Bandit Problem with a Dominant Objective [O] . Tekin, Cem, Turgay, Eralp 2017

机译：具有显性特征的多目标语境多臂强盗问题目的

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

摘要

著录项

相似文献

相关主题

期刊订阅