Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

机译：将滑动步骤重要性加权从监督学习扩展到强化学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Stochastic gradient descent (SGD) has been in the center of many advances in modern machine learning. SGD processes examples sequentially, updating a weight vector in the direction that would most reduce the loss for that example. In many applications, some examples are more important than others and, to capture this, each example is given a non-negative weight that modulates its impact. Unfortunately, if the importance weights are highly variable they can greatly exacerbate the difficulty of setting the step-size parameter of SGD. To ease this difficulty, Karampatziakis and Langford developed a class of elegant algorithms that are much more robust in the face of highly variable importance weights in supervised learning. In this paper we extend their idea, which we call "sliding step", to reinforcement learning, where importance weighting can be particularly variable due to the importance sampling involved in off-policy learning algorithms. We compare two alternative ways of doing the extension in the linear function approximation setting, then introduce specific sliding-step versions of the TD(0) and Emphatic TD(0) learning algorithms. We prove the convergence of our algorithms and demonstrate their effectiveness on both on-policy and off-policy problems. Overall, our new algorithms appear to be effective in bringing the robustness of the sliding-step technique from supervised learning to reinforcement learning.

机译：随机梯度下降（SGD）一直是现代机器学习中许多进步的中心。 SGD按顺序处理示例，在最能减少该示例损失的方向上更新权重向量。在许多应用程序中，某些示例比其他示例更重要，并且为了捕捉到这一点，每个示例都被赋予了非负权重来调节其影响。不幸的是，如果重要性权重高度可变，它们可能会大大加剧设置SGD步长参数的难度。为了缓解这一难题，Karampatziakis和Langford开发了一种优雅的算法，在监督学习中重要性权重变化很大的情况下，它们的鲁棒性更高。在本文中，我们将他们的想法（称为“滑动步骤”）扩展到强化学习，其中，由于非策略学习算法所涉及的重要性采样，重要性加权可能会特别可变。我们比较了在线性函数逼近设置中进行扩展的两种替代方法，然后介绍了TD（0）和Emphatic TD（0）学习算法的特定滑步版本。我们证明了算法的收敛性，并证明了它们在政策上和政策外问题上的有效性。总体而言，我们的新算法在将滑步技术的健壮性从监督学习引入强化学习方面似乎是有效的。

著录项

来源
《International Joint Conference on Artificial Intelligence Workshops》|2019年|67-82|共16页
会议地点
作者
Tian Tian; Richard S.Sutton;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Reinforcement learning; Temporal difference learning; Off-policy;

机译：强化学习;时间差异学习;脱离政策;

相似文献

外文文献
中文文献
专利

1. A Deep Learning Algorithm for the Max-Cut Problem Based on Pointer Network Structure with Supervised Learning and Reinforcement Learning Strategies [J] . Shenshen Gu, Yue Yang Mathematics . 2020,第2期

机译：一种深入学习算法，基于指针网络结构与监督学习和加固学习策略
2. Specialization in Hierarchical Learning Systems: A Unified Information-theoretic Approach for Supervised, Unsupervised and Reinforcement Learning [J] . Heinke Hihn, Daniel A. Braun Neural processing letters . 2020,第3期

机译：分层学习系统的专业化：统一的信息 - 监督，无监督和强化学习的理论方法
3. Improving RTS Game AI by Supervised Policy Learning, Tactical Search, and Deep Reinforcement Learning [J] . Barriga Nicolas A., Stanescu Marius, Besoain Felipe, IEEE computational intelligence magazine . 2019,第3期

机译：通过监督策略学习，战术搜索和深度强化学习来改善RTS Game AI
4. A comparison of supervised and reinforcement learning methods on a reinforcement learning task [C] . Gullapalli, V. Intelligent Control, 1991., Proceedings of the 1991 IEEE International Symposium on . 1991

机译：强化学习任务的监督学习和强化学习方法比较
5. Utilization of Supervised and Reinforcement Learning in the Automation of the Classical Atari Game “Pong” [D] . Waterreus, Andrew J. 2019

机译：经典阿塔里游戏“PONG”自动化监督和加固学习的利用
6. An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making reward prediction and punishment learning [O] . Pragathi P. Balasubramani, V. Srinivasa Chakravarthy, Balaraman Ravindran, 2014

机译：扩展的基底神经节强化学习模型以了解5-羟色胺和多巴胺在基于风险的决策奖励预测和惩罚学习中的作用
7. A Comparison Of Supervised And Reinforcement Learning Methods On A Reinforcement Learning Task [O] . Vijaykumar Gullapalli 1992

机译：强化学习任务中监督学习和强化学习方法的比较
8. Drive-Reinforcement Learning: A Self-Supervised Model for Adaptive Control [R] . Morgan, J. S., Patterson, E. C., Klopf, A. H. 1990

机译：驱动强化学习：自适应控制的自监督模型

Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

摘要

著录项

相似文献

相关主题

期刊订阅