Dueling Posterior Sampling for Preference-Based Reinforcement Learning

Ellen Novoseller; Yibing Wei; Yanan Sui; Yisong Yue; Joel Burdick

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Dueling Posterior Sampling for Preference-Based Reinforcement Learning

【24h】

Dueling Posterior Sampling for Preference-Based Reinforcement Learning

机译：基于优先级的加强学习的决斗后部抽样

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present DUELING POSTERIOR SAMPLING (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the preference feedback. As preference feedback is provided on trajectories rather than individual state-action pairs, we develop a Bayesian approach for the credit assignment problem, translating preferences to a posterior distribution over state-action reward models. We prove an asymptotic Bayesian no-regret rate for DPS with a Bayesian linear regression credit assignment model. This is the first regret guarantee for preference-based RL to our knowledge. We also discuss possible avenues for extending the proof methodology to other credit assignment models. Finally, we evaluate the approach empirically, showing competitive performance against existing baselines.

机译：在基于偏好的加强学习（RL）中，代理在接收偏好而不是绝对反馈的同时与环境交互。虽然基于偏好的RL在较高的研究活动中，但承认易解理论分析的正式框架的设计仍然是一个开放的挑战。在RL中基于偏好的匪徒学习和后部采样的思想构建，我们提出了决斗的后续采样（DPS），它采用基于偏好的后部采样来学习系统动态和管理偏好反馈的底层实用程序函数。作为偏移反馈，在轨迹而不是单个状态 - 动作对上提供，我们为信用分配问题开发了贝叶斯方法，将偏好转换为在状态动作奖励模型上的后部分发。我们证明了具有贝叶斯线性回归信用分配模型的DPS的渐近贝叶斯无悔率。这是首次对我们知识的基于偏好的RL的遗憾保障。我们还讨论了将证明方法扩展到其他信用分配模型的可能途径。最后，我们统一地评估了对现有基线的竞争性能。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2020年第2010期|共10页
作者
Ellen Novoseller; Yibing Wei; Yanan Sui; Yisong Yue; Joel Burdick;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm [J] . Robert Busa-Fekete, Balazs Szoerenyi, Paul Weng, Machine Learning . 2014,第3期

机译：基于偏好的强化学习：使用基于偏好的竞速算法进行进化直接策略搜索
2. Preference-based Online Learning with Dueling Bandits: A Survey [J] . Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, Journal of machine learning research . 2021,第a期

机译：基于偏好的在线学习与决斗匪徒：调查
3. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? [J] . Ian Osband, Benjamin Van Roy JMLR: Workshop and Conference Proceedings . 2017,第2011期

机译：为什么后面抽样比乐观学习更乐观？
4. Implicit Posterior Sampling Reinforcement Learning for Continuous Control [C] . Shaochen Wang, Bin Li International Conference on Neural Information Processing . 2020

机译：用于连续控制的隐式后采样增强学习
5. Sample-Efficient Nonconvex Optimization Algorithms in Machine Learning and Reinforcement Learning [D] . Xu, Pan. 2021

机译：机器学习和加固学习中的采样高效的非透露算法
6. Is Deep Reinforcement Learning Ready for Practical Applications in Healthcare? A Sensitivity Analysis of Duel-DDQN for Hemodynamic Management in Sepsis Patients [O] . MingYu Lu, Zachary Shahn, Daby Sow, 2020

机译：深增强学习是否准备用于医疗保健的实际应用？脓毒症患者血流动力学管理的DUEL-DDQN敏感性分析
7. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm [O] . Busa-Fekete, Róbert, Szörényi, Balázs, Weng, Paul, 2014

机译：基于偏好的强化学习：使用基于偏好的竞速算法进行进化直接策略搜索

Dueling Posterior Sampling for Preference-Based Reinforcement Learning

摘要

著录项

相似文献

相关主题

期刊订阅