Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

Johannes Fuernkranz; Eyke Huellermeier; Weiwei Cheng; Sang-Hyeun Park

首页> 外文期刊>Machine Learning >Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

【24h】

Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

机译：基于偏好的强化学习：形式框架和策略迭代算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preference-based approach to reinforcement learning is the observation that in many real-world domains, numerical feedback signals are not readily available, or are defined arbitrarily in order to satisfy the needs of conventional RL algorithms. Instead, we propose an alternative framework for reinforcement learning, in which qualitative reward signals can be directly used by the learner. The framework may be viewed as a generalization of the conventional RL framework in which only a partial order between policies is required instead of the total order induced by their respective expected long-term reward. Therefore, building on novel methods for preference learning, our general goal is to equip the RL agent with qualitative policy models, such as ranking functions that allow for sorting its available actions from most to least promising, as well as algorithms for learning such models from qualitative feedback. As a proof of concept, we realize a first simple instantiation of this framework that defines preferences based on utilities observed for trajectories. To that end, we build on an existing method for approximate policy iteration based on rollouts. While this approach is based on the use of classification methods for generalization and policy learning, we make use of a specific type of preference learning method called label ranking. Advantages of preference-based approximate policy iteration are illustrated by means of two case studies.

机译：本文朝着机器学习两个子领域（即偏好学习和强化学习（RL））的集成迈出了第一步。基于偏好的强化学习方法的一个重要动机是观察到，在许多实际领域中，数字反馈信号不容易获得，或者为了满足常规RL算法的需要而任意定义。相反，我们提出了一种用于强化学习的替代框架，在该框架中，学习者可以直接使用定性奖励信号。该框架可以看作是常规RL框架的概括，其中仅需要策略之间的部分顺序，而不是由它们各自的预期长期奖励引起的总顺序。因此，我们建立在偏好学习的新方法的基础上，我们的总体目标是为RL代理配备定性策略模型，例如排名功能，以将其可用操作从最大到最小排序，以及从中学习此类模型的算法。定性反馈。作为概念证明，我们实现了此框架的第一个简单实例，该实例基于对轨迹观察到的效用来定义首选项。为此，我们基于现有的方法，基于部署对策略进行近似迭代。尽管此方法基于分类方法的泛化和策略学习，但我们使用一种称为标签排名的特殊类型的偏好学习方法。通过两个案例研究说明了基于偏好的近似策略迭代的优势。

著录项

来源
《Machine Learning》 |2012年第2期|p.123-156|共34页
作者
Johannes Fuernkranz; Eyke Huellermeier; Weiwei Cheng; Sang-Hyeun Park;
展开▼
作者单位

Department of Computer Science, TU Darmstadt, Darmstadt, Germany;

Department of Mathematics and Computer Science, Marburg University, Marburg, Germany;

Department of Mathematics and Computer Science, Marburg University, Marburg, Germany;

Department of Computer Science, TU Darmstadt, Darmstadt, Germany;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
reinforcement learning; preference learning;

机译：强化学习;偏好学习;

相似文献

外文文献
中文文献
专利

1. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm [J] . Robert Busa-Fekete, Balazs Szoerenyi, Paul Weng, Machine Learning . 2014,第3期

机译：基于偏好的强化学习：使用基于偏好的竞速算法进行进化直接策略搜索
2. A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis [J] . ABHIJIT GOSAVI Machine Learning . 2004,第1期

机译：一种基于策略迭代的平均奖励强化学习算法：收益管理与收敛性分析的实证结果
3. A Formal Verification Model for Performance Analysis of Reinforcement Learning Algorithms Applied t o Dynamic Networks [J] . Shrirang Ambaji KULKARNI, Raghavendra G . RAO Journal of Applied Computer Science & Mathematics . 2017,第1期

机译：动态网络中强化学习算法性能分析的形式验证模型
4. Preference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning [C] . Weiwei Cheng, Johannes Fuernkranz, Eyke Huellermeier, European conference on machine learning and knowledge discovery in databases;ECML PKDD 2011 . 2011

机译：基于首选项的策略迭代：利用首选项学习进行强化学习
5. On the convergence of model -free policy iteration algorithms for reinforcement learning: Stochastic approximation under discontinuous mean dynamics. [D] . Williams, John Kevin. 2000

机译：关于用于增强学习的无模型策略迭代算法的收敛：不连续平均动力学下的随机逼近。
6. Formal Medical Knowledge Representation Supports Deep Learning Algorithms Bioinformatics Pipelines Genomics Data Analysis and Big Data Processes [O] . Ferdinand Dhombres, Jean Charlet 2019

机译：正式的医学知识表示支持深度学习算法生物信息学管道基因组学数据分析和大数据过程
7. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm [O] . Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, 2012

机译：基于偏好的钢筋学习：正式框架和政策迭代算法

Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅