首页> 外文会议>IEEE-RAS International Conference on Humanoid Robotics >Learning Deep Robot Controllers by Exploiting Successful and Failed Executions
【24h】

Learning Deep Robot Controllers by Exploiting Successful and Failed Executions

机译:通过利用成功和失败的执行来学习深层机器人控制器

获取原文

摘要

The prohibitively amount of data required when learning complex nonlinear policies, such as deep neural networks, has been significantly reduced with guided policy search (GPS) algorithms. However, while learning the control policy, the robot might fail and therefore generate unaccept-able guiding samples. Failures may arise, for example, as a consequence of modeling or environmental uncertainties, and thus unsuccessful interactions should be explicitly considered while learning a complex policy. Currently, GPS methods update the robot policy discarding or giving low probability to unsuccessful trials. In other words, these methods overlook the existence of poorly performing executions, and therefore tend to underestimate the information of these interactions in next iterations. In this paper we propose to learn deep neural network controllers with an extension of GPS that considers trajectories optimized with dualist constraints. These constraints are aimed at assisting the policy learning so that the trajectory distributions updated at each iteration are similar to good trajectory distributions (e.g., sucessful executions) while differing from bad trajectory distributions (e.g. failures). We show that neural network policies guided by trajectories optimized with our method reduce the failures during the policy exploration phase, and therefore encourage safer interactions. This may have a relevant impact in tasks that involve physical contact with the environment or human partners.
机译:学习复杂的非线性策略,如深层神经网络时所需的数据量惊人,已经显著与引导政策搜索(GPS)算法降低。然而,一边学习控制政策,机器人可能会失败,因此生成取消接受,能够引导样品。故障可能出现,例如,造型或环境不确定性的结果,从而成功的互动应该明确的考虑,同时学习一个复杂的政策。目前,GPS方法更新机器人策略丢弃或不成功的试验给予低概率。换句话说,这些方法忽略效果不佳的处决的存在,并且因此往往低估在下一迭代这些相互作用的信息。在本文中,我们提出了学习深层神经网络控制器与GPS的扩展,认为轨迹与二元约束优化。这些限制的目的是帮助而从坏的轨迹分布(如故障)不同的政策学习,使每个更新的轨迹分布迭代类似于好的轨迹分布(例如,SUCESSFUL执行)。我们表明,我们的方法优化轨迹引导神经网络策略在政策勘探阶段降低故障发生率,因此鼓励更安全的相互作用。这可能在涉及与环境或人类伙伴的身体接触任务的相关影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号