...
首页> 外文期刊>Neurocomputing >A temporal difference method for multi-objective reinforcement learning
【24h】

A temporal difference method for multi-objective reinforcement learning

机译:多目标强化学习的时差方法

获取原文
获取原文并翻译 | 示例
           

摘要

This work describes MPQ-learning, an algorithm that approximates the set of all deterministic non dominated policies in multi-objective Markov decision problems, where rewards are vectors and each component stands for an objective to maximize. MPQ-learning generalizes directly the ideas of Q-learning to the multi-objective case. It can be applied to non-convex Pareto frontiers and finds both supported and unsupported solutions. We present the results of the application of MPQ-learning to some benchmark problems. The algorithm solves successfully these problems, so showing the feasibility of this approach. We also compare MPQ-learning to a standard linearization procedure that computes only supported solutions and show that in some cases MPQ-learning can be as effective as the scalarization method. (C) 2017 Elsevier B.V. All rights reserved.
机译:这项工作描述了MPQ学习,一种近似于多目标Markov决策问题中所有确定性非支配策略的集合的算法,其中奖励是向量,每个分量代表一个最大化的目标。 MPQ学习将Q学习的思想直接推广到多目标案例。它可以应用于非凸面的帕累托边界,并找到受支持和不受支持的解决方案。我们介绍了将MPQ学习应用于某些基准测试问题的结果。该算法成功解决了这些问题,因此证明了该方法的可行性。我们还将MPQ学习与仅计算受支持解决方案的标准线性化过程进行了比较,并表明在某些情况下MPQ学习与标量方法一样有效。 (C)2017 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号