【24h】

Topics to Avoid: Demoting Latent Confounds in Text Classification

机译:应避免的主题:降级文本分类中的潜在混杂问题

获取原文

摘要

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.~1
机译:尽管在许多文本分类任务上表现出色,但是深度神经网络仍倾向于学习特定于训练数据的频繁的表面模式,并且总不能很好地泛化。在这项工作中,我们观察到有关本地语言识别任务的限制。我们发现,在测试集上表现良好的标准文本分类器最终会学习与预测任务混杂的主题功能(例如,如果输入文本中提到瑞典,则分类器会预测作者的母语是瑞典语)。我们提出了一种表示潜在主题混杂的方法,以及一种通过预测输入文本和混杂标签来“取消学习”混杂特征的模型;但是我们以交替的方式对抗性地训练这两个预测变量,以学习预测正确标签的文本表示形式,但不太容易使用有关混淆的信息。我们证明了该模型的泛化效果更好,并且学习了表示写作风格而不是内容的特征。〜1

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号