首页> 外文会议> >More on training strategies for critic and action neural networks in dual heuristic programming method
【24h】

More on training strategies for critic and action neural networks in dual heuristic programming method

机译:关于双重启发式编程方法的批评者和动作神经网络训练策略的更多信息

获取原文

摘要

The article describes a modification to the usual procedures for training of critic and action neural networks in the dual heuristic programming (DHP) method (D. Prokhorov and D. Wunsch, 1996; R. Santiago, 1995; P. Werbos, 1994). This modification entails updating both the critic and the action networks at each computational cycle, rather than only one at a time. The distinction lies in the introduction of a (real) second copy of the critic network whose weights are adjusted less often and the "desired value" for training the other critic is obtained from this critic copy. Previously (G. Lendaris and C. Paintz, 1997), the proposed modified training strategy was demonstrated on the pole cart controller problem: the full 6 dimensional state vector was input to the critic and action NNs, however, the utility function only involved pole angle, not distance along the track (x). For the first set of results presented here, the 3 states associated with the x variable were eliminated from the inputs to the NNs, keeping the same utility function previously defined. This resulted in improved learning and controller performance. From this point, the method is applied to two additional problems, each of increasing complexity: for the first, an x-related term is added to the utility function for the pole cart problem, and simultaneously, the x-related states were added back in to the NNs (i.e., increase number of state variables used from 3 to 6); the second relates to steering a vehicle with independent drive motors on each wheel. The problem contexts and experimental results are provided.
机译:这篇文章描述了在双重启发式编程(DHP)方法中对批评者和动作神经网络进行训练的常规程序的一种修改(D. Prokhorov和D. Wunsch,1996; R。Santiago,1995; P。Werbos,1994)。此修改需要在每个计算周期而不是一次仅更新评论者和动作网络。区别在于引入了(真实的)评论者网络的第二个副本,该网络的权重调整得不太频繁,并且从该评论者副本中获得了训练另一个评论者的“期望值”。以前(G. Lendaris和C. Paintz,1997),在杆车控制器问题上演示了所提出的改进的训练策略:将完整的6维状态向量输入到评论者和动作NN,但是,效用函数仅涉及杆角度,而不是沿轨道的距离(x)。对于此处显示的第一组结果,从NN的输入中消除了与x变量关联的3个状态,并保持了先前定义的效用函数。这导致了学习和控制器性能的提高。从这一点出发,该方法适用于两个其他问题,每个问题都越来越复杂:首先,将与x相关的项添加到极点车问题的效用函数中,同时将与x相关的状态加回去进入神经网络(即,将使用的状态变量的数量从3增加到6);第二个方面涉及对每个车轮上具有独立驱动马达的车辆进行转向。提供了问题背景和实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号