A Survey of Preference-Based Online Learning with Bandit Algorithms

机译：基于派发算法的偏好在线学习调查

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available-instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state-of-the-art in this field, that we refer to as preference-based multi-armed bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our systematization is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.

机译：在机器学习中，多武装匪徒的概念是指一类在线学习问题，其中代理应该在顺序决策过程的过程中同时探索和利用给定的一组选择替代方案。在标准设置中，代理商从随机反馈中以实值奖励的形式学习。然而，在许多应用中，数值奖励信号不容易获得 - 而是仅提供较弱的信息，特别是以定性比较的形式提供较弱的信息，而是在替代方向的比较方面。该观察结果具有多武装强盗问题的变体的研究，其中用于学习的反馈类型和预测目标的类型更普遍表示。本文的目的是提供对该领域的最先进的调查，我们称之为基于偏好的多武装匪徒。为此，我们概述了文献中被考虑的问题以及解决它们的方法。我们的系统化主要基于这些方法对数据生成过程的假设以及与此相关的方法，基于偏好的反馈的性质。

著录项

来源
《International Conference on Algorithmic Learning Theory》|2014年||共22页
会议地点
作者
Robert Busa-Fekete; Eyke Hullermeier;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301.6-53;
关键词
Multi-armed bandits; Online learning; Preference learning; Ranking; Top-k selection; Exploration/exploitation; Cumulative regret; Sample complexity; PAC learning;

机译：多武装匪徒;在线学习;偏好学习;排名;TOP-K选择;探索/剥削;累积遗憾;样本复杂性;PAC学习;

相似文献

外文文献
中文文献
专利

1. Preference-based Online Learning with Dueling Bandits: A Survey [J] . Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, Journal of machine learning research . 2021,第a期

机译：基于偏好的在线学习与决斗匪徒：调查
2. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm [J] . Robert Busa-Fekete, Balazs Szoerenyi, Paul Weng, Machine Learning . 2014,第3期

机译：基于偏好的强化学习：使用基于偏好的竞速算法进行进化直接策略搜索
3. Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys [J] . Ramón Ferri-García, María del Mar Rueda PLoS One . 2020,第4期

机译：使用机器学习分类算法来控制在线调查中的选择偏差的倾向分数调整
4. A Survey of Preference-Based Online Learning with Bandit Algorithms [C] . Robert Busa-Fekete, Eyke Huellermeier International conference on algorithmic learning theory . 2014

机译：利用Bandit算法进行基于偏好的在线学习的调查
5. Algorithms for bandit online linear optimization. [D] . Dani, Varsha. 2008

机译：土匪在线线性优化算法。
6. Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys [O] . Ramón Ferri-García, María del Mar Rueda 2020

机译：使用机器学习分类算法来控制在线调查中选择偏差的倾向分数调整
7. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm [O] . Busa-Fekete, Róbert, Szörényi, Balázs, Weng, Paul, 2014

机译：基于偏好的强化学习：使用基于偏好的竞速算法进行进化直接策略搜索

A Survey of Preference-Based Online Learning with Bandit Algorithms

摘要

著录项

相似文献

相关主题

期刊订阅