首页> 外文会议>International joint conference on natural language processing;Conference on empirical methods in natural language processing >Universal Adversarial Triggers for Attacking and Analyzing NLP WARNING: This paper contains model outputs which are offensive in nature
【24h】

Universal Adversarial Triggers for Attacking and Analyzing NLP WARNING: This paper contains model outputs which are offensive in nature

机译:攻击和分析NLP的通用对抗触发器警告:本文包含本质上令人反感的模型输出

获取原文

摘要

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of "why" questions in SQuAD to be answered "to kill american people", and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.
机译:对抗性示例突出了模型漏洞,对于评估和解释很有用。我们定义了通用的对抗性触发器:令牌的输入不可知序列,当与数据集的任何输入连接时,它们会触发模型以产生特定的预测。我们建议对令牌进行梯度引导搜索,以找到可以成功触发目标预测的短触发序列(例如,一个单词用于分类,四个单词用于语言建模)。例如,触发因素会导致SNLI含意度从89.94%降至0.55%,SQuAD中“为什么”问题的72%被回答为“杀死美国人”,而GPT-2语言模型即使在有条件的情况下也会产生种族歧视在非种族背景下。此外,尽管使用白盒访问特定模型对触发器进行了优化,但对于我们考虑的所有任务,它们都会转移到其他模型。最后,由于触发器与输入无关,因此可以对全局模型行为进行分析。例如,他们确认SNLI模型利用了数据集偏差,并有助于诊断通过阅读理解模型而学到的启发式方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号