首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
【24h】

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

机译:人体强盗反馈的可靠性和可读性,用于序列到序列增强学习

获取原文

摘要

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.
机译:我们对序列到序列学习的人类强盗反馈的加固学习(RL)进行了研究,示出了强盗神经机翻译(NMT)的任务。我们调查人类强盗反馈的可靠性,并分析了可靠性对奖励估计学的可读性的影响,以及奖励估计质量对整体RL任务的影响。我们对基础(5点额定值)和序数(成对偏好)反馈的分析表明,其内部间注入者α-An段协议是可比的。为标准化的基本反馈获得最佳可靠性,并且基本反馈也是最容易学习和概括的。最后,通过将培训的基于回归的奖励估计器集成在基于主题反馈的基于回归的奖励估计器中,可以获得超过1bleu的改进,以便在NMT中的RL中进行800译。这表明,即使来自少量相当可靠的人反馈,也可以指向较大规模的应用的巨大潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号