首页> 外文会议>International Conference on Knowledge and Systems Engineering >From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection
【24h】

From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection

机译:从普通语言模型到下游任务:改善基于Roberta的越南仇恨语音检测

获取原文

摘要

Natural language processing (NLP) is a fast-growing field of artificial intelligence. Since the Transformer [32] was introduced by Google in 2017, a large number of language models such as BERT, GPT, and ELMo have been inspired by this architecture. These models were trained on huge datasets and achieved state-of-the-art results on natural language understanding. However, fine-tuning a pre-trained language model on much smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lack of training data and imbalanced data. In this paper, we propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection. We first tune the PhoBERT1[9] on our dataset by re-training the model on the Masked Language Model (MLM) task; then, we employ its encoder for text classification. In order to preserve pre-trained weights while learning new feature representations, we further utilize different training techniques: layer freezing, block-wise learning rate, and label smoothing. Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection (HSD) campaign2 with 0.7221 F1 score.
机译:自然语言处理(NLP)是一种快速增长的人工智能领域。由于谷歌2017年介绍了变压器[32],因此大量的语言模型如BERT,GPT和ELMO被这种架构的启发。这些模型在巨大的数据集上培训并实现了最先进的自然语言理解。然而,微调在下游任务的小型数据集上进行预先接受的语言模型需要仔细设计的管道,以减轻数据集的问题,例如缺乏训练数据和不平衡数据。在本文中,我们提出了一种管道来使通用罗伯拉语言模型适应特定的文本分类任务:越南仇恨语音检测。我们首先调整phobert 1 通过在屏蔽语言模型(MLM)任务上重新培训模型,在我们的数据集上;然后,我们使用其编码器进行文本分类。为了在学习新特征表示的同时保留预先训练的权重,我们进一步利用了不同的训练技术:层冻结,块明智的学习率和标签平滑。我们的实验证明,我们的拟议管道显着提高了性能,实现了越南仇恨讲话检测(HSD)运动的新型最先进 2 0.7221 F1分数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号