首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

【24h】

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

机译：人体强盗反馈的可靠性和可读性，用于序列到序列增强学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.

机译：我们对序列到序列学习的人类强盗反馈的加固学习（RL）进行了研究，示出了强盗神经机翻译（NMT）的任务。我们调查人类强盗反馈的可靠性，并分析了可靠性对奖励估计学的可读性的影响，以及奖励估计质量对整体RL任务的影响。我们对基础（5点额定值）和序数（成对偏好）反馈的分析表明，其内部间注入者α-An段协议是可比的。为标准化的基本反馈获得最佳可靠性，并且基本反馈也是最容易学习和概括的。最后，通过将培训的基于回归的奖励估计器集成在基于主题反馈的基于回归的奖励估计器中，可以获得超过1bleu的改进，以便在NMT中的RL中进行800译。这表明，即使来自少量相当可靠的人反馈，也可以指向较大规模的应用的巨大潜力。

著录项

来源
《Annual meeting of the Association for Computational Linguistics 》|2018年|lxx p. 1372-2060|共12页
会议地点
作者
Julia Kreutzer; Joshua Uyheng; Stefan Riezler;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程 ;
关键词

相似文献

外文文献
中文文献
专利

1. Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach [J] . Nafee Mourad, Ali Ezzeddine, Babak Nadjar Araabi, Journal of robotics . 2020 ,第Pta1期

机译：从演示和人类评估反馈中学习：使用反增强学习方法处理稀疏性和缺陷
2. Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach [J] . Nafee Mourad, Ali Ezzeddine, Babak Nadjar Araabi, Journal of robotics . 2020 ,第2期

机译：从演示和人类评估反馈中学习：使用逆强化学习方法处理稀疏性和缺陷
3. Human Feedback as Action Assignment in Interactive Reinforcement Learning [J] . Raza Syed Ali, Williams Mary-Anne ACM transactions on autonomous and adaptive systems . 2020 ,第4期

机译：人类反馈作为互动强化学习中的行动任务
4. Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning [C] . Julia Kreutzer, Joshua Uyheng, Stefan Riezler Annual meeting of the Association for Computational Linguistics . 2018

机译：人类强盗反馈对序列到序列强化学习的可靠性和可学习性
5. Adaptive Preference Learning with Bandit Feedback: Information Filtering, Dueling Bandits and Incentivizing Exploration [D] . Chen, Bangrui. 2017

机译：带有土匪反馈的自适应偏好学习：信息过滤，决斗土匪和激励探索
6. Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing [O] . Stefano Palminteri, Germain Lefebvre, Emma J. Kilford, 2017

机译：强化学习中的确认偏差：来自反事实反馈处理的证据
7. Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback [O] . Nguyen, Khanh, Daumé III, Hal, Boyd-Graber, Jordan 2017

机译：基于maTLaB的强盗神经机器人翻译强化学习模拟人类反馈

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

摘要

著录项

相似文献

相关主题

期刊订阅