...
首页> 外文期刊>Natural language engineering >Combining n-grams and deep convolutional features for language variety classification
【24h】

Combining n-grams and deep convolutional features for language variety classification

机译:结合n元语法和深度卷积特征进行语言多样性分类

获取原文
获取原文并翻译 | 示例

摘要

This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.
机译:本文提出了一种新颖的神经体系结构,在语言种类分类任务上能够胜过最新系统。该体系结构是一种混合体,将基于字符的卷积神经网络(CNN)特征与加权n-grams(BON)特征结合在一起,因此能够利用字符级和文档/语料库级信息。我们在VarDial 2017 DSL共享任务的相似语言(DSL)语言品种基准数据集上进行了系统测试,该数据集包含来自六个不同语言组的数据以及两个较小的数据集(阿拉伯方言识别(ADI )语料库和德国方言识别(GDI)语料库,分别来自VarDial 2016 ADI和VarDial 2018 GDI共享任务)。就加权F1分数而言,我们在DSL共享任务中的胜出系统胜过约0.4个百分点,在ADI共享任务中的胜出系统胜过约0.2个百分点,而没有进行任何特定语言组的测试参数调整。消融研究表明,加权的BON功能比基于CNN的功能对系统整体性能的贡献更大,这部分解释了深度学习方法在过去的VarDial DSL共享任务中的不竞争力。最后,我们已经在ClowdFlows平台上提供的工作流程中实现了我们的系统,以使研究社区的非编程成员也可以轻松使用它。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号