首页> 外文会议>International Joint Conference on Artificial Intelligence Workshops >Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning
【24h】

Extending Sliding-Step Importance Weighting from Supervised Learning to Reinforcement Learning

机译:将滑动步骤重要性加权从监督学习扩展到强化学习

获取原文

摘要

Stochastic gradient descent (SGD) has been in the center of many advances in modern machine learning. SGD processes examples sequentially, updating a weight vector in the direction that would most reduce the loss for that example. In many applications, some examples are more important than others and, to capture this, each example is given a non-negative weight that modulates its impact. Unfortunately, if the importance weights are highly variable they can greatly exacerbate the difficulty of setting the step-size parameter of SGD. To ease this difficulty, Karampatziakis and Langford developed a class of elegant algorithms that are much more robust in the face of highly variable importance weights in supervised learning. In this paper we extend their idea, which we call "sliding step", to reinforcement learning, where importance weighting can be particularly variable due to the importance sampling involved in off-policy learning algorithms. We compare two alternative ways of doing the extension in the linear function approximation setting, then introduce specific sliding-step versions of the TD(0) and Emphatic TD(0) learning algorithms. We prove the convergence of our algorithms and demonstrate their effectiveness on both on-policy and off-policy problems. Overall, our new algorithms appear to be effective in bringing the robustness of the sliding-step technique from supervised learning to reinforcement learning.
机译:随机梯度下降(SGD)一直是现代机器学习中许多进步的中心。 SGD按顺序处理示例,在最能减少该示例损失的方向上更新权重向量。在许多应用程序中,某些示例比其他示例更重要,并且为了捕捉到这一点,每个示例都被赋予了非负权重来调节其影响。不幸的是,如果重要性权重高度可变,它们可能会大大加剧设置SGD步长参数的难度。为了缓解这一难题,Karampatziakis和Langford开发了一种优雅的算法,在监督学习中重要性权重变化很大的情况下,它们的鲁棒性更高。在本文中,我们将他们的想法(称为“滑动步骤”)扩展到强化学习,其中,由于非策略学习算法所涉及的重要性采样,重要性加权可能会特别可变。我们比较了在线性函数逼近设置中进行扩展的两种替代方法,然后介绍了TD(0)和Emphatic TD(0)学习算法的特定滑步版本。我们证明了算法的收敛性,并证明了它们在政策上和政策外问题上的有效性。总体而言,我们的新算法在将滑步技术的健壮性从监督学习引入强化学习方面似乎是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号