Combining n-grams and deep convolutional features for language variety classification

Martinc Matej; Pollak Senja

首页> 外文期刊>Natural language engineering >Combining n-grams and deep convolutional features for language variety classification

【24h】

Combining n-grams and deep convolutional features for language variety classification

机译：结合n元语法和深度卷积特征进行语言多样性分类

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.

机译：本文提出了一种新颖的神经体系结构，在语言种类分类任务上能够胜过最新系统。该体系结构是一种混合体，将基于字符的卷积神经网络（CNN）特征与加权n-grams（BON）特征结合在一起，因此能够利用字符级和文档/语料库级信息。我们在VarDial 2017 DSL共享任务的相似语言（DSL）语言品种基准数据集上进行了系统测试，该数据集包含来自六个不同语言组的数据以及两个较小的数据集（阿拉伯方言识别（ADI ）语料库和德国方言识别（GDI）语料库，分别来自VarDial 2016 ADI和VarDial 2018 GDI共享任务）。就加权F1分数而言，我们在DSL共享任务中的胜出系统胜过约0.4个百分点，在ADI共享任务中的胜出系统胜过约0.2个百分点，而没有进行任何特定语言组的测试参数调整。消融研究表明，加权的BON功能比基于CNN的功能对系统整体性能的贡献更大，这部分解释了深度学习方法在过去的VarDial DSL共享任务中的不竞争力。最后，我们已经在ClowdFlows平台上提供的工作流程中实现了我们的系统，以使研究社区的非编程成员也可以轻松使用它。

著录项

来源
《Natural language engineering 》 |2019年第5期| 607-632| 共26页
作者
Martinc Matej; Pollak Senja;
展开▼
作者单位

Jozef Stefan Inst Dept Knowledge Technol Ljubljana Slovenia;

Jozef Stefan Inst Dept Knowledge Technol Ljubljana Slovenia|Univ Edinburgh Usher Inst Edinburgh Med Sch Usher Inst Populat Hlth Sci & Informat Edinburgh Midlothian Scotland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
language variety; author profiling; text classification; convolutional neural network; bag-of-n-grams;

机译：语言多样性;作者简介;文字分类卷积神经网络克袋;

相似文献

外文文献
中文文献
专利

1. DeepChart: Combining deep convolutional networks and deep belief networks in chart classification [J] . Binbin Tang, Xiao Liu, Jie Lei, Signal processing . 2016 ,第Jula期

机译：DeepChart：在图表分类中结合深层卷积网络和深层信念网络
2. Discrimination of Chrysanthemum Varieties Using Hyperspectral Imaging Combined with a Deep Convolutional Neural Network [J] . Na Wu, Chu Zhang, Xiulin Bai, Molecules . 2018 ,第11期

机译：高光谱成像结合深度卷积神经网络对菊花品种的鉴别
3. Maize Seed Variety Classification Using the Integration of Spectral and Image Features Combined with Feature Transformation Based on Hyperspectral Imaging [J] . Min Huang, Chujie He, Qibing Zhu, Applied Sciences . 2016 ,第6期

机译：基于光谱和图像特征融合并结合基于高光谱成像的特征转换的玉米种子品种分类
4. Tuebingen-Oslo Team at the VarDial 2018 Evaluation Campaign: An Analysis of N-gram Features in Language Variety Identification [C] . Cagri Coeltekin, Taraka Rama, Verena Blaschke Fifth workshop on NLP for similar langues, varieties and dialects . 2018

机译：Tuebingen-Oslo团队参加VarDial 2018评估活动：语言多样性识别中的N-gram特征分析
5. Deep Neural Language Model for Text Classification Based on Convolutional and Recurrent Neural Networks [D] . Hassan, Abdalraouf. 2018

机译：基于卷积神经网络和递归神经网络的深度神经语言文本分类模型
6. Discrimination of Chrysanthemum Varieties Using Hyperspectral Imaging Combined with a Deep Convolutional Neural Network [O] . Na Wu, Chu Zhang, Xiulin Bai, 2018

机译：高光谱成像结合深度卷积神经网络对菊花品种的鉴别
7. Combining n-grams and deep convolutional features for language variety classification [O] . Matej Martinc, Senja Pollak 2019

机译：结合N-Grams和深度卷积特性语言品种分类

Combining n-grams and deep convolutional features for language variety classification

摘要

著录项

相似文献

相关主题

期刊订阅