首页> 外国专利> Method for improving policy, method for improving policy, and device improving apparatus

Method for improving policy, method for improving policy, and device improving apparatus

机译:改进政策的方法,改进政策的方法,以及设备改进装置

摘要

PROBLEM TO BE SOLVED: To generate a feedback coefficient matrix that provides a policy for optimizing an accumulated cost or an accumulated reward.;SOLUTION: A change in the state of a control target 110 is defined by a linear difference equation, and an immediate cost or an immediate reward for the control target 110 is defined by the state of the control target 110 and the quadratic form of an input of the control target 110. A policy improvement device 100 generates a TD error with respect to an estimated state value function obtained by estimating a state value function by perturbing each component of a feedback coefficient matrix that gives a policy. The policy improvement device 100 generates an estimated gradient function matrix that estimates a gradient function matrix of the state value function related to the feedback coefficient matrix for the state based on the TD error and the perturbation. The policy improvement device 100 updates the feedback coefficient matrix using the generated estimated gradient function matrix.;SELECTED DRAWING: Figure 1;COPYRIGHT: (C)2019,JPO&INPIT
机译:要解决的问题:生成反馈系数矩阵,该反馈系数矩阵提供了优化累积成本或累积奖励的策略。;解决方案:通过线性差分方程来定义控制目标110的状态的变化,以及直接成本。或者对控制目标110的直接奖励由控制目标110的状态和控制目标110的输入的Quadation形式定义。策略改进设备100相对于获得的估计状态值函数生成TD误差通过扰乱给予策略的反馈系数矩阵的每个组件来估计状态值函数。策略改进设备100生成估计梯度函数矩阵,其估计与基于TD误差和扰动的状态相关的状态值函数的梯度函数矩阵。策略改进设备100使用所生成的估计梯度函数矩阵更新反馈系数矩阵。;选定的绘图:图1;版权:(c)2019,JPO和INPIT

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号