...
首页> 外文期刊>Neural Networks and Learning Systems, IEEE Transactions on >Multi Pseudo Q-Learning-Based Deterministic Policy Gradient for Tracking Control of Autonomous Underwater Vehicles
【24h】

Multi Pseudo Q-Learning-Based Deterministic Policy Gradient for Tracking Control of Autonomous Underwater Vehicles

机译:基于多伪Q学习的确定性梯度算法用于水下机器人的跟踪控制

获取原文
获取原文并翻译 | 示例

摘要

This paper investigates trajectory tracking problem for a class of underactuated autonomous underwater vehicles (AUVs) with unknown dynamics and constrained inputs. Different from existing policy gradient methods which employ single actor critic but cannot realize satisfactory tracking control accuracy and stable learning, our proposed algorithm can achieve high-level tracking control accuracy of AUVs and stable learning by applying a hybrid actors-critics architecture, where multiple actors and critics are trained to learn a deterministic policy and action-value function, respectively. Specifically, for the critics, the expected absolute Bellman error-based updating rule is used to choose the worst critic to be updated in each time step. Subsequently, to calculate the loss function with more accurate target value for the chosen critic, Pseudo Q-learning, which uses subgreedy policy to replace the greedy policy in Q-learning, is developed for continuous action spaces, and Multi Pseudo Q-learning (MPQ) is proposed to reduce the overestimation of action-value function and to stabilize the learning. As for the actors, deterministic policy gradient is applied to update the weights, and the final learned policy is defined as the average of all actors to avoid large but bad updates. Moreover, the stability analysis of the learning is given qualitatively. The effectiveness and generality of the proposed MPQ-based deterministic policy gradient (MPQ-DPG) algorithm are verified by the application on AUV with two different reference trajectories. In addition, the results demonstrate high-level tracking control accuracy and stable learning of MPQ-DPG. Besides, the results also validate that increasing the number of the actors and critics will further improve the performance.
机译:本文研究了一类动力不足,输入受限的欠驱动自动水下航行器(AUV)的轨迹跟踪问题。与现有的仅采用单个行为者批评者但无法实现令人满意的跟踪控制精度和稳定学习的策略梯度方法不同,我们提出的算法可以通过应用多个行为者-批评者的混合结构来实现AUV的高级别跟踪控制精度和稳定学习。批评家接受了确定性政策和行动价值功能的培训。具体地,对于评论者,使用预期的基于Bellman绝对误差的绝对更新规则来选择在每个时间步中要更新的最差的评论者。随后,为了为所选批评者计算出具有更准确目标值的损失函数,针对连续动作空间开发了伪Q-学习(Pseudo Q-learning),该伪Q-learning使用子贪婪策略代替Q-learning中的贪婪策略,提出MPQ以减少对动作值函数的高估并稳定学习。对于参与者,使用确定性策略梯度来更新权重,并且最终学习的策略定义为所有参与者的平均值,以避免较大但较差的更新。此外,定性地给出了学习的稳定性分析。通过在具有两个不同参考轨迹的AUV上的应用,验证了所提出的基于MPQ的确定性策略梯度(MPQ-DPG)算法的有效性和通用性。此外,结果证明了MPQ-DPG具有高水平的跟踪控制精度和稳定的学习能力。此外,结果还证实,增加演员和评论家的人数将进一步提高表演水平。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号