首页> 外国专利> ROBUST REINFORCEMENT LEARNING FOR CONSTRAINT SATISFACTION WHILE ACCOUNTING FOR MODEL MISSPECIFICATION

ROBUST REINFORCEMENT LEARNING FOR CONSTRAINT SATISFACTION WHILE ACCOUNTING FOR MODEL MISSPECIFICATION

机译：考虑模型错误的约束满足鲁棒强化学习

页面导航

摘要
著录项
相似文献

摘要

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for learning a control policy for controlling an agent. One of the methods includes sampling a mini-batch comprising one or more observation - action - reward tuples generated as a result of interactions of a first agent with a first environment; determining an update to current values of Q network parameters by minimizing a robust constrained temporal difference (TD) error that accounts for possible perturbations of the states of the first environment represented by the observations in the observation - action - reward tuples; and determining, using the Q-value neural network, an update to the policy network parameters using the sampled mini-batch of observation - action - reward tuples.

机译：用于学习用于控制代理的控制策略的方法、系统和装置，包括编码在计算机存储介质上的计算机程序。其中一种方法包括对小批量样品进行取样，该小批量样品包括一个或多个观察-行动-奖励元组，该元组是由于第一试剂与第一环境的相互作用而产生的；通过最小化鲁棒约束时间差（TD）误差来确定Q网络参数当前值的更新，该误差可解释由观察-动作-奖励元组中的观察值表示的第一个环境的状态的可能扰动；以及使用Q值神经网络，使用采样的小批量观察-行动-奖励元组来确定策略网络参数的更新。

著录项

公开/公告号WO2022069758A1

专利类型
公开/公告日2022-04-07

原文格式PDF
申请/专利权人 DEEPMIND TECHNOLOGIES LIMITED;
展开▼

申请/专利号WO2021EP77237
发明设计人 MANKOWITZ DANIEL J.;CALIAN DAN-ANDREI;MANN TIMOTHY ARTHUR;
展开▼

申请日2021-10-04
分类号G06N3;G06N3/04;G06N3/08;
国家 EP
入库时间 2022-08-25 00:24:44

相似文献

专利
外文文献
中文文献