首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Dueling Posterior Sampling for Preference-Based Reinforcement Learning
【24h】

Dueling Posterior Sampling for Preference-Based Reinforcement Learning

机译:基于优先级的加强学习的决斗后部抽样

获取原文
           

摘要

In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present DUELING POSTERIOR SAMPLING (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the preference feedback. As preference feedback is provided on trajectories rather than individual state-action pairs, we develop a Bayesian approach for the credit assignment problem, translating preferences to a posterior distribution over state-action reward models. We prove an asymptotic Bayesian no-regret rate for DPS with a Bayesian linear regression credit assignment model. This is the first regret guarantee for preference-based RL to our knowledge. We also discuss possible avenues for extending the proof methodology to other credit assignment models. Finally, we evaluate the approach empirically, showing competitive performance against existing baselines.
机译:在基于偏好的加强学习(RL)中,代理在接收偏好而不是绝对反馈的同时与环境交互。虽然基于偏好的RL在较高的研究活动中,但承认易解理论分析的正式框架的设计仍然是一个开放的挑战。在RL中基于偏好的匪徒学习和后部采样的思想构建,我们提出了决斗的后续采样(DPS),它采用基于偏好的后部采样来学习系统动态和管理偏好反馈的底层实用程序函数。作为偏移反馈,在轨迹而不是单个状态 - 动作对上提供,我们为信用分配问题开发了贝叶斯方法,将偏好转换为在状态动作奖励模型上的后部分发。我们证明了具有贝叶斯线性回归信用分配模型的DPS的渐近贝叶斯无悔率。这是首次对我们知识的基于偏好的RL的遗憾保障。我们还讨论了将证明方法扩展到其他信用分配模型的可能途径。最后,我们统一地评估了对现有基线的竞争性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号