No 'Love' Lost: Defending Hate Speech Detection Models Against Adversaries

机译：不会失去“爱”：捍卫针对敌人的仇恨语音检测模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Although current state-of-the-art hate speech detection models achieve praiseworthy results, these models have shown themselves to be vulnerable to attacks. Easy-to-execute lexical evasion schemes such as removal of whitespace from a given text creates significant issues for word-based hate speech detection models. In this paper, we reproduce the results of five cutting-edge models as well as four significant evasion schemes from prior work. These schemes are required to maintain readability which enables us to recreate the original data. We present several new defenses that leverage this need for maintained meaning and readability, and these schemes perform on par with or exceed the results of adversarial retraining. Furthermore, we demonstrate that each lexical attack or evasion scheme can be overcome with our new defense mechanisms with some reducing the effectiveness of the scheme to a mere .1 to .01 drop in F-1 score. We also propose a new evasion scheme that outperforms those in previous work along with a corresponding defense. Using our results as a foundation, we contend that hate speech detection models can be defended against lexically morphed data without the need for significant retraining. Our work suggests that by utilizing the requirement for preserved meaning, one can create a suitable defense against evasion schemes with a high reversal rate.

机译：尽管当前最新的仇恨语音检测模型取得了令人称赞的结果，但这些模型显示出自己容易受到攻击。诸如从给定文本中删除空格之类的易于执行的词汇规避方案对于基于单词的仇恨语音检测模型产生了重大问题。在本文中，我们重现了五个前沿模型的结果以及先前工作中的四个重要规避方案。需要使用这些方案来保持可读性，以使我们能够重新创建原始数据。我们提出了几种新的防御措施，它们利用了保持含义和可读性的这种需求，并且这些方案的表现与对抗性再训练的结果相当或超过。此外，我们证明，使用我们的新防御机制可以克服每种词汇攻击或规避方案，但有些方案会使该方案的有效性降低到F-1分数仅下降0.1至0.01。我们还提出了一种新的规避方案，与相应的防御措施相比，其性能优于先前的工作。以我们的结果为基础，我们认为仇恨语音检测模型可以针对词法变形数据进行防御，而无需进行大量的重新培训。我们的工作表明，通过利用保留含义的要求，可以针对高回避率的逃避方案创建适当的防御措施。

著录项

来源
《International Conference on Ubiquitous Information Management and Communication》|2020年|1-6|共6页
会议地点 Taichung(CN)
作者
Melody Moh; Teng-Sheng Moh; Brian Khieu;
展开▼
作者单位

San José State University Department of Computer Science San José CA USA;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Voice activity detection; Machine learning; Feature extraction; Data models; Computational modeling; Computer science; Social network services;

机译：语音活动检测；机器学习；特征提取;数据模型；计算建模；计算机科学;社交网络服务;
入库时间 2022-08-26 14:42:36

相似文献

外文文献
中文文献
专利

1. Comparing pre-trained language models for Spanish hate speech detection [J] . Miriam Plaza-del-Arco Flor, Dolores Molina-Gonzalez M., Alfonso Urena-Lopez L., Expert systems with applications . 2021,第Mara期

机译：比较预先培训的语言模型，用于西班牙语仇恨语音检测
2. Towards Hate Speech Detection at Large via Deep Generative Modeling [J] . Wullach Tomer, Adler Amir, Minkov Einat IEEE internet computing . 2021,第2期

机译：通过深度生成型建模大量讨论仇恨语音检测
3. Hate speech detection and racial bias mitigation in social media based on BERT model [J] . Marzieh Mozafari, Reza Farahbakhsh, No?l Crespi PLoS One . 2020,第8期

机译：基于BERT模型的社交媒体讨厌讲话检测和种族偏见缓解
4. No "Love" Lost: Defending Hate Speech Detection Models Against Adversaries [C] . Melody Moh, Teng-Sheng Moh, Brian Khieu International Conference on Ubiquitous Information Management and Communication . 2020

机译：没有“爱”丢失：防止对手的仇恨语音探测模型
5. On the Detection of Hate Speech, Hate Speakers and Polarized Groups in Online Social Media [D] . Warmsley, Dana. 2017

机译：在线社交媒体中仇恨言论，仇恨演说者和两极分化群体的检测
6. Extended Defense Systems: I. Adversary-Defender Modeling Grammar for Vulnerability Analysis and Threat Assessment [R] . Merkle, P. B. 2006

机译：扩展防御系统：I。用于漏洞分析和威胁评估的对手 - 后卫建模语法

No 'Love' Lost: Defending Hate Speech Detection Models Against Adversaries

摘要

著录项

相似文献

相关主题

期刊订阅