首页> 外文会议>Workshop on NLP for similar languages, varieties and dialects >QCRI @ DSL 2016: Spoken Arabic Dialect Identification Using Textual Features
【24h】

QCRI @ DSL 2016: Spoken Arabic Dialect Identification Using Textual Features

机译:QCRI @ DSL 2016:使用文本功能识别阿拉伯语方言

获取原文

摘要

The paper describes the QCRI submissions to the shared task of automatic Arabic dialect classification into 5 Arabic variants, namely Egyptian, Gulf, Levantine, North-African (Maghrebi), and Modern Standard Arabic (MSA). The relatively small training set is automatically generated from an ASR system. To avoid over-fitting on such small data, we selected and designed features that capture the morphological essence of the different dialects. We submitted four runs to the Arabic sub-task. For all runs, we used a combined feature vector of character bigrams, trigrams, 4-grams, and 5-grams. We tried several machine-learning algorithms, namely Logistic Regression, Naive Bayes, Neural Networks, and Support Vector Machines (SVM) with linear and string kernels. Our submitted runs used SVM with a linear kernel. In the closed submission, we got the best accuracy of 0.5136 and the third best weighted Fl score, with a difference of less than 0.002 from the best system.
机译:本文介绍了QCRI提交给自动阿拉伯语方言分类的共同任务,将其分为5种阿拉伯语变体,即埃及语,海湾语,黎凡特语,北非语(Maghrebi)和现代标准阿拉伯语(MSA)。相对较小的训练集是从ASR系统自动生成的。为了避免在如此小的数据上过度拟合,我们选择并设计了可捕捉不同方言形态本质的特征。我们向阿拉伯语子任务提交了四次运行。对于所有运行,我们使用字符双字母组,三字母组,4克和5克的组合特征向量。我们尝试了几种机器学习算法,分别是Logistic回归,朴素贝叶斯,神经网络和带有线性和字符串内核的支持向量机(SVM)。我们提交的运行使用具有线性内核的SVM。在封闭式提交中,我们获得的最佳准确度为0.5136,而加权Fl得分排在第三位,与最佳系统的差异小于0.002。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号