首页> 外国专利> EXPERIENCE REINFORCEMENT TYPE REINFORCEMENT LEARNING SYSTEM, EXPERIENCE REINFORCEMENT TYPE REINFORCEMENT LEARNING METHOD AND EXPERIENCE REINFORCEMENT TYPE REINFORCEMENT LEARNING PROGRAM

EXPERIENCE REINFORCEMENT TYPE REINFORCEMENT LEARNING SYSTEM, EXPERIENCE REINFORCEMENT TYPE REINFORCEMENT LEARNING METHOD AND EXPERIENCE REINFORCEMENT TYPE REINFORCEMENT LEARNING PROGRAM

机译：经验强化型强化学习系统，经验强化型强化学习方法和经验强化型强化学习计划

页面导航

摘要
著录项
相似文献

摘要

PROBLEM TO BE SOLVED: To provide an experience reinforcement type reinforcement learning system or the like capable of suppressing a large influence on a learning result at which learning which avoids punishment can obtain reward.SOLUTION: The experience reinforcement type reinforcement learning system includes: a state recognition means 1 for recognizing the state of an agent A; a rule selection means 2 for selecting a selectable state/action rule on the basis of an evaluation value; a reward evaluation value reinforcement means 3 for defining the series of all the state/action rules selected when the reward is obtained as an episode and reinforcing reward evaluation values of all the state/action rules of the episode altogether by weight for the reward; a punishment evaluation value reinforcement means 4 for defining the series of all the state/action rules selected when punishment is received as an episode and reinforcing the punishment evaluation values of all the state/action rules of the episode altogether by weight for the punishment; and an evaluation value operation means 5 for obtaining an evaluation value Q by a function expression: Q=Q(q[+], q[-]) when the reward evaluation value is defined as q[+] and the punishment evaluation value is defined as q[-].

机译：解决的问题：提供一种体验抑制型强化学习系统等，该学习增强型学习系统或类似系统能够抑制对学习结果的较大影响，从而避免惩罚，从而获得奖励。解决方案：体验增强型强化学习系统包括：状态识别装置1，用于识别代理人A的状态;规则选择装置2，用于基于评估值选择可选择的状态/动作规则;奖励评估值增强装置3，用于定义当获得奖励作为情节时选择的所有状态/动作规则的序列，并按权重对情节的所有状态/动作规则的奖励评价值进行加权;惩罚评估值增强装置4，用于定义当接收到作为情节的惩罚时选择的所有状态/动作规则的序列，并通过权重对该情节的所有状态/动作规则的惩罚评估值进行加权合并;评估值运算装置5，用于在将奖励评估值定义为q [+]，将惩罚评估值设为0时，通过函数表达式Q = Q（q [+]，q [-]）获得评估值Q。定义为q [-]。

著录项

公开/公告号JP2011204036A

专利类型
公开/公告日2011-10-13

原文格式PDF
申请/专利权人 INSTITUTE OF NATIONAL COLLEGES OF TECHNOLOGY JAPAN;
展开▼

申请/专利号JP20100071118
发明设计人 SAWA YOICHIRO;YAMAGUCHI MASASHI;
展开▼

申请日2010-03-25
分类号G06N3;
国家 JP
入库时间 2022-08-21 18:25:52

相似文献

专利
外文文献
中文文献