首页> 外国专利> Method and apparatus for improved reward-based learning using adaptive distance metrics

Method and apparatus for improved reward-based learning using adaptive distance metrics

机译:使用自适应距离量度改进基于奖励的学习的方法和装置

摘要

The present invention is a method and an apparatus for reward-based learning of policies for managing or controlling a system or plant. In one embodiment, a method for reward-based learning includes receiving a set of one or more exemplars, where at least two of the exemplars comprise a (state, action) pair for a system, and at least one of the exemplars includes an immediate reward responsive to a (state, action) pair. A distance metric and a distance-based function approximator estimating long-range expected value are then initialized, where the distance metric computes a distance between two (state, action) pairs, and the distance metric and function approximator are adjusted such that a Bellman error measure of the function approximator on the set of exemplars is minimized. A management policy is then derived based on the trained distance metric and function approximator.
机译:本发明是一种用于基于奖励的策略学习以管理或控制系统或工厂的策略的方法和设备。在一个实施例中,一种用于基于奖励的学习的方法包括:接收一组一个或多个示例,其中至少两个示例包括系统的(状态,动作)对,并且至少一个示例包括立即数。响应(状态,动作)对的奖励。然后初始化距离度量和估计远程期望值的基于距离的函数逼近器,其中距离度量计算两对(状态,动作)对之间的距离,并调整距离度量和函数逼近器,以使Bellman误差在示例集上最小化函数逼近器的度量。然后,基于训练的距离度量和函数近似器,得出管理策略。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号