首页> 外文会议>International Conference on Algorithmic Learning Theory >A Survey of Preference-Based Online Learning with Bandit Algorithms
【24h】

A Survey of Preference-Based Online Learning with Bandit Algorithms

机译:基于派发算法的偏好在线学习调查

获取原文

摘要

In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available-instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state-of-the-art in this field, that we refer to as preference-based multi-armed bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our systematization is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.
机译:在机器学习中,多武装匪徒的概念是指一类在线学习问题,其中代理应该在顺序决策过程的过程中同时探索和利用给定的一组选择替代方案。在标准设置中,代理商从随机反馈中以实值奖励的形式学习。然而,在许多应用中,数值奖励信号不容易获得 - 而是仅提供较弱的信息,特别是以定性比较的形式提供较弱的信息,而是在替代方向的比较方面。该观察结果具有多武装强盗问题的变体的研究,其中用于学习的反馈类型和预测目标的类型更普遍表示。本文的目的是提供对该领域的最先进的调查,我们称之为基于偏好的多武装匪徒。为此,我们概述了文献中被考虑的问题以及解决它们的方法。我们的系统化主要基于这些方法对数据生成过程的假设以及与此相关的方法,基于偏好的反馈的性质。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号