首页> 外文期刊>Engineering Applications of Artificial Intelligence >Learning adversarial attack policies through multi-objective reinforcement learning
【24h】

Learning adversarial attack policies through multi-objective reinforcement learning

机译:通过多目标强化学习学习对抗性攻击政策

获取原文
获取原文并翻译 | 示例

摘要

Deep Reinforcement Learning has shown promising results in learning policies for complex sequential decision-making tasks. However, different adversarial attack strategies have revealed the weakness of these policies to perturbations to their observations. Most of these attacks have been built on existing adversarial example crafting techniques used to fool classifiers, where an adversarial attack is considered a success if it makes the classifier outputs any wrong class. The major drawback of these approaches when applied to decision-making tasks is that they are blind for long-term goals. In contrast, this paper suggests that it is more appropriate to view the attack process as a sequential optimization problem, with the aim of learning a sequence of attacks, where the attacker must consider the long-term effects of each attack. In this paper, we propose that such an attack policy must be learned with two objectives in view. On the one hand, the attack must pursue the maximum performance loss of the attacked policy. On the other hand, it also should minimize the cost of the attacks. Therefore, in this paper we propose a novel modelization of the process of learning an attack policy as a Multi-objective Markov Decision Process with two objectives: maximizing the performance loss of the attacked policy and minimizing the cost of the attacks. We also reveal the conflicting nature of these two objectives and use a Multi-objective Reinforcement Learning algorithm to draw the Pareto fronts for four well-known tasks: the GridWorld, the Cartpole, the Mountain car and the Breakout.
机译:深度加强学习表明有希望的成果学习复杂的顺序决策任务的政策。然而,不同的对抗攻击策略揭示了这些政策对他们观察的扰动的弱点。这些攻击中的大多数都是基于用于欺骗分类器的现有的对手示例制作技术,如果它使分类器输出任何错误的类,则对普发出现攻击被认为是成功的。这些方法在应用于决策任务时的主要​​缺点是他们对长期目标失明。相比之下,本文认为将攻击过程视为顺序优化问题是更合适的,目的是学习一系列攻击,攻击者必须考虑每次攻击的长期影响。在本文中,我们建议使用两个目标来学习这种攻击政策。一方面,攻击必须追求攻击政策的最大绩效损失。另一方面,它也应该尽量减少攻击的成本。因此,在本文中,我们提出了一种新颖的建模,将攻击政策的过程中的进程作为一个有两个目标的多目标马尔可夫决策过程,最大化攻击政策的性能损失,并最大限度地降低攻击的成本。我们还揭示了这两个目标的冲突性质,并使用多目标强化学习算法为四个着名的任务绘制帕累托前线:Gridworld,Cartpole,山地汽车和突破。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号