首页> 外国专利> Method for improving policy, method for improving policy, and device improving apparatus

Method for improving policy, method for improving policy, and device improving apparatus

机译：改进政策的方法，改进政策的方法，以及设备改进装置

页面导航

摘要
著录项
相似文献

摘要

PROBLEM TO BE SOLVED: To generate a feedback coefficient matrix that provides a policy for optimizing an accumulated cost or an accumulated reward.;SOLUTION: A change in the state of a control target 110 is defined by a linear difference equation, and an immediate cost or an immediate reward for the control target 110 is defined by the state of the control target 110 and the quadratic form of an input of the control target 110. A policy improvement device 100 generates a TD error with respect to an estimated state value function obtained by estimating a state value function by perturbing each component of a feedback coefficient matrix that gives a policy. The policy improvement device 100 generates an estimated gradient function matrix that estimates a gradient function matrix of the state value function related to the feedback coefficient matrix for the state based on the TD error and the perturbation. The policy improvement device 100 updates the feedback coefficient matrix using the generated estimated gradient function matrix.;SELECTED DRAWING: Figure 1;COPYRIGHT: (C)2019,JPO&INPIT

机译：要解决的问题：生成反馈系数矩阵，该反馈系数矩阵提供了优化累积成本或累积奖励的策略。;解决方案：通过线性差分方程来定义控制目标110的状态的变化，以及直接成本。或者对控制目标110的直接奖励由控制目标110的状态和控制目标110的输入的Quadation形式定义。策略改进设备100相对于获得的估计状态值函数生成TD误差通过扰乱给予策略的反馈系数矩阵的每个组件来估计状态值函数。策略改进设备100生成估计梯度函数矩阵，其估计与基于TD误差和扰动的状态相关的状态值函数的梯度函数矩阵。策略改进设备100使用所生成的估计梯度函数矩阵更新反馈系数矩阵。;选定的绘图：图1;版权：（c）2019，JPO和INPIT

著录项

公开/公告号JP6958808B2

专利类型
公开/公告日2021-11-02

原文格式PDF
申请/专利权人富士通株式会社;学校法人沖縄科学技術大学院大学学園;
展开▼

申请/专利号JP20170177985
发明设计人佐々木智丈;内部英治;銅谷賢治;穴井宏和;屋並仁史;岩根秀直;
展开▼

申请日2017-09-15
分类号G06N20;G06N99;
国家 JP
入库时间 2022-08-24 22:02:45

相似文献

专利
外文文献
中文文献