【24h】

Probing Toxic Content in Large Pre-Trained Language Models

机译:探测大型预先训练的语言模型中的有毒内容

获取原文

摘要

Large pre-trained language models (PTLMs) have been shown to carry biases towards different social groups which leads to the reproduction of stereotypical and toxic content by major NLP systems. We propose a method based on logistic regression classifiers to probe English, French, and Arabic PTLMs and quantify the potentially harmful content that they convey with respect to a set of templates. The templates are prompted by a name of a social group followed by a cause-effect relation. We use PTLMs to predict masked tokens at the end of a sentence in order to examine how likely they enable toxicity towards specific communities. We shed the light on how such negative content can be triggered within unrelated and benign contexts based on evidence from a large-scale study, then we explain how to take advantage of our methodology to assess and mitigate the toxicity transmitted by PTLMs.
机译:已经显示出大型预先接受的语言模型(PTLMS)对不同的社会群体进行偏见,这导致主要NLP系统的陈规定型和毒性内容的再现。 我们提出了一种基于Logistic回归分类器的方法,以探测英语,法语和阿拉伯语PTLMS,并量化它们在一组模板中传达的潜在有害内容。 通过社交组的名称提示模板,然后是原因效果关系。 我们使用PTLMS预测句子末尾的蒙版令牌,以便检查它们对特定社区的毒性有多大程度。 我们阐明了这种负面内容如何在不相关的和良性上下文中基于来自大规模研究的证据,然后解释如何利用我们的方法来评估和减轻PTLMS传播的毒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号