首页> 外文期刊>Procedia Computer Science >Automatic Arabic Dialect Classification Using Deep Learning Models
【24h】

Automatic Arabic Dialect Classification Using Deep Learning Models

机译:使用深度学习模型自动进行阿拉伯语方言分类

获取原文
           

摘要

Recently, the vast use of social media and the high availability of internet access have produced a considerably different textual data from the formal and standard data on the Web. This includes various Arabic dialectal languages, which are the native spoken languages of Arabic speakers. The presence of textual Arabic dialectal languages on the Web has brought many new opportunities as well as challenges for machine learning and Arabic language processing. The identification of this type of informal data has its crucial effect on several applications such as sentiment analysis and machine translation. However, the standard NLP tools developed for traditional data fall short due to nature of dialectal textual data. Deep learning tools have proven to be very effective in processing social Media dialectal text. In this paper, we consider a variety of deep learning models for the automatic classification of Arabic dialectal text. We use a free large manually-annotated dataset known as Arabic Online Commentary (AOC), which includes several Dialectal Arabic (DA) along with the Modern Standard Arabic (MSA), [3]. We consider the most frequent dialects in the dataset. Namely, the Egyptian (EGP), Levantine (LEV), and Gulf –including Iraqi - (GLF). Four different deep neural network models have been implemented to examine the Arabic dialectal classification problem for each pair of the 3 dialects (binary classification experiments) as well as one ternary-classification experiment including all dialects together. The results show a varying but promising performance of the models for each pair of dialects. Furthermore, a closer examination on the manually-annotated AOC dataset has been carried out and hence, we conclude that there is a serious demand for a thorough refinement and review of the AOC annotated sentences as it is an important benchmark dataset in the field.
机译:最近,社交媒体的广泛使用和互联网的高可用性已经产生了与Web上的正式数据和标准数据截然不同的文本数据。其中包括各种阿拉伯方言语言,它们是阿拉伯语使用者的母语。文本阿拉伯方言语言在网络上的存在为机器学习和阿拉伯语言处理带来了许多新机遇,也带来了挑战。这类非正式数据的识别对情感分析和机器翻译等多种应用具有至关重要的作用。但是,由于方言文本数据的性质,为传统数据开发的标准NLP工具不足。事实证明,深度学习工具在处理社交媒体方言文本方面非常有效。在本文中,我们考虑了各种深度学习模型,用于阿拉伯方言文本的自动分类。我们使用一个免费的大型手动注释数据集,称为阿拉伯在线注释(AOC),其中包括几种方言阿拉伯语(DA)以及现代标准阿拉伯语(MSA),[3]。我们考虑数据集中最常用的方言。即,埃及(EGP),黎凡特(LEV)和海湾地区-包括伊拉克-(GLF)。已经实现了四种不同的深度神经网络模型,以检查3种方言中每对的阿拉伯语方言分类问题(二进制分类实验)以及一个包括所有方言的三元分类实验。结果表明,每对方言模型的性能各不相同,但前景看好。此外,已经对人工注释的AOC数据集进行了更深入的检查,因此,我们得出结论,由于它是该领域的重要基准数据集,因此迫切需要对AOC注释的句子进行彻底的改进和审查。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号