首页> 外文会议>SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence >CostNet: An End-to-End Framework for Goal-Directed Reinforcement Learning
【24h】

CostNet: An End-to-End Framework for Goal-Directed Reinforcement Learning

机译:CostNet:用于目标导向的强化学习的端到端框架

获取原文

摘要

Reinforcement Learning (RL) is a general framework concerned with an agent that seeks to maximize rewards in an environment. The learning typically happens through trial and error using explorative methods, such as ∈-greedy. There are two approaches, model-based and model-free reinforcement learning, that show concrete results in several disciplines. Model-based RL learns a model of the environment for learning the policy while model-free approaches are fully explorative and exploitative without considering the underlying environment dynamics. Model-free RL works conceptually well in simulated environments, and empirical evidence suggests that trial and error lead to a near-optimal behavior with enough training. On the other hand, model-based RL aims to be sample efficient, and studies show that it requires far less training in the real environment for learning a good policy. A significant challenge with RL is that it relies on a well-defined reward function to work well for complex environments and such a reward function is challenging to define. Goal-Directed RL is an alternative method that learns an intrinsic reward function with emphasis on a few explored trajectories that reveals the path to the goal state. This paper introduces a novel reinforcement learning algorithm for predicting the distance between two states in a Markov Decision Process. The learned distance function works as an intrinsic reward that fuels the agent's learning. Using the distance-metric as a reward, we show that the algorithm performs comparably to model-free RL while having significantly better sample-efficiently in several test environments.
机译:强化学习(RL)是与寻求最大化环境中奖励的代理人的一般框架。学习通常通过使用探索方法的试验和错误发生,例如∈贪婪。有两种方法,基于模型和无模型的增强学习,该学习显示了具体的几个学科。基于模型的RL了解用于学习政策的环境模型,而无模型方法是完全探索和剥削的,而不考虑潜在的环境动态。无模型RL在模拟环境中概念性地工作,经验证据表明,试验和错误导致具有足够培训的近乎最佳行为。另一方面,基于模型的RL旨在是样本效率,研究表明它需要更少的培训,以学习良好的政策。与RL的重大挑战是,它依赖于定义良好的奖励功能,以适应复杂的环境,并且这种奖励函数是具有挑战性的。目标导向的RL是一种替代方法,可以了解内在的奖励功能,重点是一些探索目标状态路径的探索轨迹。本文介绍了一种新型加强学习算法,用于预测Markov决策过程中两个状态的距离。学习距离函数作为燃料代理学习的内在奖励。使用距离度量作为奖励,我们表明该算法与无模型RL进行相差,同时在多个测试环境中具有显着更好的样本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号