首页> 外文会议>International Conference on Pattern Analysis and Intelligent Systems >Detecting Algerian Sub-Dialects of On-Line Commentators in Social Media Networks
【24h】

Detecting Algerian Sub-Dialects of On-Line Commentators in Social Media Networks

机译:在社交媒体网络中检测在线评论员的阿尔及利亚子方言

获取原文

摘要

The amount of textual information written in Romanized Arabic (or Arabizi) is increasing exponentially day-by-day, when investigating automatic methods to process such texts is becoming a need. Hence, in this investigation, we are addressing the identification of Algerian sub-dialects of social media comments written in Romanized Arabic. Moreover, we address the Arabizi-French code-switching phenomenon.To the best of our knowledge, this is the first work addressing the tackled problem on written documents. Accordingly, we propose a new corpus (DZDC12 corpus), and the general guidelines to collect the texts as well.As a first attempt to deal with the Algerian sub-dialects identification, we use two state-of-the-art tools of language identification (langid.py and LangDetect), as well as three classifiers (i.e. SVM, Multinomial NB and Gaussian NB) based on a heuristic of features selection. The evaluation conducted on the DZDC12 corpus showed low performances, as well as confirmed our expectation that the tackled problem requires an extensive study to select the reliable feature set.
机译:在罗马化阿拉伯语(或Arabizi)中编写的文本信息的数量正在逐年增加,当调查处理这些文本的自动方法正在成为需求时。因此,在这项调查中,我们正在解决在罗马化阿拉伯语中撰写的社交媒体评论的阿尔及利亚子方言。此外,我们解决了Arabizi-French的代码切换现象。对于我们所知,这是第一个解决书面文件上解决问题的工作。因此,我们提出了一种新的语料库(DZDC12语料库),以及收集文本的一般指导方针。首次尝试处理阿尔及利亚子方言识别,我们使用两个最先进的语言工具识别(Langid.py和Langdetect),以及基于特征选择的启发式的三个分类器(即SVM,多项式NB和高斯NB)。在DZDC12语料库上进行的评估表现出低的性能,并确认了我们的期望,即解决问题需要广泛的研究来选择可靠的功能集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号