首页> 外文会议>IEEE International Conference on Automation Science and Engineering >Path planning with user route preference - A reward surface approximation approach using orthogonal Legendre polynomials
【24h】

Path planning with user route preference - A reward surface approximation approach using orthogonal Legendre polynomials

机译:具有用户路径偏好的路径规划 - 一种使用正交图例多项式的奖励表面近似方法

获取原文

摘要

As self driving cars become more ubiquitous, users would look for natural ways of informing the car AI about their personal choice of routes. This choice is not always dictated by straightforward logic such as shortest distance or shortest time, and can be influenced by hidden factors, such as comfort and familiarity. This paper presents a path learning algorithm for such applications, where from limited positive demonstrations, an autonomous agent learns the user's path preference and honors that choice in its route planning, but has the capability to adopt alternate routes, if the original choice(s) become impractical. The learning problem is modeled as a Markov decision process. The states (way-points) and actions (to move from one way-point to another) are pre-defined according to the existing network of paths between the origin and destination and the user's demonstration is assumed to be a sample of the preferred path. The underlying reward function which captures the essence of the demonstration is computed using an inverse reinforcement learning algorithm and from that the entire path mirroring the expert's demonstration is extracted. To alleviate the problem of state space explosion when dealing with a large state space, the reward function is approximated using a set of orthogonal polynomial basis functions with a fixed number of coefficients regardless of the size of the state space. A six fold reduction in total learning time is achieved compared to using simple basis functions, that has dimensionality equal to the number of distinct states.
机译:随着自动驾驶汽车变得更加普遍,用户会寻找通知汽车AI关于他们个人选择路线的自然方式。这种选择并不总是通过直截了当的逻辑来决定,例如最短的距离或最短的时间,并且可以受到隐藏因素的影响,例如舒适和熟悉程度。本文介绍了这种应用的路径学习算法,其中来自有限的积极演示,自主代理学习用户的路径偏好和荣誉在其路线规划中选择,但如果原始选择,则具有采用替代路线的能力变得不切实际。学习问题被建模为Markov决策过程。根据原点和目的地之间的现有路径网络预先界定状态(方式点)和操作(从单向点到另一个方式移动),并且假设用户的演示是优选路径的样本。使用逆钢筋学习算法计算捕获示范本质的潜在奖励功能,并从中提取专家演示的整个路径镜像。为了减轻在处理大状态空间时的状态空间爆炸问题,使用具有固定数量的系数的一组正交多项式基函数来近似奖励函数,无论状态空间的大小如何。与使用简单的基本函数相比,实现了总学习时间的六倍减少,其具有等于不同状态的数量的维度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号