
A Fine-Grained Annotated Multi-Dialectal Arabic Corpus




We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohens Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.
机译:我们介绍了ARAP-Tweet 2.0,它是一个由500万个方言阿拉伯语推文和5000万个单词组成的语料库,来自17个阿拉伯国家的3000个Twitter用户。与第一个版本相比,新的语料库在数据量和注释质量方面有了显着改进。在方言,性别和三个年龄段方面,它是完全平衡的:25岁以下,25至34岁之间以及35岁以上。本文介绍了创建语料库的过程,该过程从收集方言短语以找到用户,到注释其帐户并检索其推文开始。我们还报告了使用注释者间协议措施对注释质量进行的评估,这些措施不仅适用于整个语料库,还适用于整个语料库。获得的结果相当可观,分别用于标注性别,方言和年龄的Cohens Kappa平均值分别为0.99、0.92和0.88。我们还将讨论在开发该语料库时遇到的一些挑战。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号