Universal Adversarial Triggers for Attacking and Analyzing NLP WARNING: This paper contains model outputs which are offensive in nature

机译：攻击和分析NLP的通用对抗触发器警告：本文包含本质上令人反感的模型输出

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of "why" questions in SQuAD to be answered "to kill american people", and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.

机译：对抗性示例突出了模型漏洞，对于评估和解释很有用。我们定义了通用的对抗性触发器：令牌的输入不可知序列，当与数据集的任何输入连接时，它们会触发模型以产生特定的预测。我们建议对令牌进行梯度引导搜索，以找到可以成功触发目标预测的短触发序列（例如，一个单词用于分类，四个单词用于语言建模）。例如，触发因素会导致SNLI含意度从89.94％降至0.55％，SQuAD中“为什么”问题的72％被回答为“杀死美国人”，而GPT-2语言模型即使在有条件的情况下也会产生种族歧视在非种族背景下。此外，尽管使用白盒访问特定模型对触发器进行了优化，但对于我们考虑的所有任务，它们都会转移到其他模型。最后，由于触发器与输入无关，因此可以对全局模型行为进行分析。例如，他们确认SNLI模型利用了数据集偏差，并有助于诊断通过阅读理解模型而学到的启发式方法。

著录项

来源
《International joint conference on natural language processing;Conference on empirical methods in natural language processing》|2019年|2153-2162|共10页
会议地点
作者
Eric Wallace; Shi Feng; Nikhil Kandpal; Matt Gardner; Sameer Singh;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Real-time, Robust and Adaptive Universal Adversarial Attacks Against Speaker Recognition Systems [J] . Xie Yi, Li Zhuohang, Shi Cong, Journal of signal processing systems for signal, image, and video technology . 2021,第10期

机译：对扬声器识别系统的实时，鲁棒和适应性的普遍对抗攻击
2. Universal adversarial attacks on deep neural networks for medical image classification [J] . Hokuto Hirano, Akinori Minagi, Kazuhiro Takemoto BMC Medical Imaging . 2021,第1期

机译：对医学图像分类深神经网络的普遍对抗攻击
3. Event-triggered secure observer-based control for cyber-physical systems under adversarial attacks [J] . An-Yang Lu, Guang-Hong Yang Information Sciences: An International Journal . 2017,第期

机译：在对抗性攻击下，事件触发基于安全观察者的网络物理系统控制
4. Universal Adversarial Triggers for Attacking and Analyzing NLP WARNING: This paper contains model outputs which are offensive in nature [C] . Eric Wallace, Shi Feng, Nikhil Kandpal, International joint conference on natural language processing . 2019

机译：普遍的对抗触发器用于攻击和分析NLP警告：本文包含了本质上令人反感的模型输出
5. Adversarial Machine Learning in Computer Vision: Attacks and Defenses on Machine Learning Models [D] . Qin, Yi. 2021

机译：计算机视觉上的对抗机器学习：机器学习模型的攻击和防御
6. Universal adversarial attacks on deep neural networks for medical image classification [O] . Hokuto Hirano, Akinori Minagi, Kazuhiro Takemoto 2021

机译：对医学图像分类深神经网络的普遍对抗攻击
7. Universal Adversarial Triggers for Attacking and Analyzing NLP [O] . Eric Wallace, Shi Feng, Nikhil Kandpal, 2019

机译：普遍的对抗触发器攻击和分析NLP

Universal Adversarial Triggers for Attacking and Analyzing NLP WARNING: This paper contains model outputs which are offensive in nature

摘要

著录项

相似文献

相关主题

期刊订阅