...
首页> 外文期刊>Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies >Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning
【24h】

Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning

机译:结合基于梯度的方法和多目标强力学习的演化策略

获取原文
获取原文并翻译 | 示例

摘要

Multi-objective reinforcement learning (MORL) algorithms aim to approximate the Pareto frontier uniformly in multi-objective decision making problems. In the scenario of deep reinforcement learning (RL), gradient-based methods are often adopted to learn deep policies/value functions due to the fast convergence speed, while pure gradient-based methods can not guarantee a uniformly approximated Pareto frontier. On the other side, evolution strategies straightly manipulate in the solution space to achieve a well-distributed Pareto frontier, but applying evolution strategies to optimize deep networks is still a challenging topic. To leverage the advantages of both kinds of methods, we propose a two-stage MORL framework combining a gradient-based method and an evolution strategy. First, an efficient multi-policy soft actor-critic algorithm is proposed to learn multiple policies collaboratively. The lower layers of all policy networks are shared. The first-stage learning can be regarded as representation learning. Secondly, the multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) is applied to fine-tune policy-independent parameters to approach a dense and uniform estimation of the Pareto frontier. Experimental results on three benchmarks (Deep Sea Treasure, Adaptive Streaming, and Super Mario Bros) show the superiority of the proposed method.
机译:多目标强化学习(Morl)算法旨在在多目标决策中均匀地均匀地覆盖帕累托前沿。在深度加强学习(RL)的情况下,通常采用基于梯度的方法来学习由于快速收敛速度而学习深层政策/值功能,而纯基于梯度的方法不能保证均匀近似的帕累托前沿。在另一边,演变策略在解决方案空间中直接操纵,以实现一个分布良好的帕累托前沿,但应用进化策略来优化深网络仍然是一个具有挑战性的话题。为了利用两种方法的优势,我们提出了一种两级MORL框架,结合了基于梯度的方法和演化策略。首先,提出了一种有效的多策略软演员 - 批评算法来学习协同策略。共享所有策略网络的较低层。第一阶段学习可以被视为代表学习。其次,多目标协方差矩阵适应演化策略(MO-CMA-ES)应用于微调政策无关的参数,以接近帕累托前沿的密集和均匀估计。实验结果对三个基准(深海宝,自适应流和超级马里奥兄弟)表示提出的方法的优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号