A Fine-Grained Annotated Multi-Dialectal Arabic Corpus

机译：细粒注释多方言阿拉伯语语料库

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohens Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.

机译：我们介绍了ARAP-Tweet 2.0，它是一个由500万个方言阿拉伯语推文和5000万个单词组成的语料库，来自17个阿拉伯国家的3000个Twitter用户。与第一个版本相比，新的语料库在数据量和注释质量方面有了显着改进。在方言，性别和三个年龄段方面，它是完全平衡的：25岁以下，25至34岁之间以及35岁以上。本文介绍了创建语料库的过程，该过程从收集方言短语以找到用户，到注释其帐户并检索其推文开始。我们还报告了使用注释者间协议措施对注释质量进行的评估，这些措施不仅适用于整个语料库，还适用于整个语料库。获得的结果相当可观，分别用于标注性别，方言和年龄的Cohens Kappa平均值分别为0.99、0.92和0.88。我们还将讨论在开发该语料库时遇到的一些挑战。

著录项

来源
《International conference on recent advances in natural language processing》|2019年|198-204|共7页
会议地点
作者
Anis Charfi; Wajdi Zaghouani; Syed Hassan Mehdi; Esraa Mohamed;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A Morphologically Annotated Corpus and a Morphological Analyzer for Egyptian Arabic [J] . Amany Fashwan, Sameh Alansary Procedia Computer Science . 2021,第a期

机译：埃及阿拉伯语的形态学注释的语料库和形态分析仪
2. A continuous vocoder for statistical parametric speech synthesis and its evaluation using an audio-visual phonetically annotated Arabic corpus [J] . Mohammed Salah Al-Radhi, Omnia Abdo, Tamas Gabor Csapo, Computer speech and language . 2020,第Mara期

机译：用于统计参量语音合成的连续声码器及其使用视听注解的阿拉伯语语料库的评估
3. A set of parameters for automatically annotating a Sentiment Arabic Corpus [J] . Guellil Imane, Darwish Kareem, Azouaou Faical International journal of web information systems . 2019,第5期

机译：一组用于自动注释阿拉伯语语料库的参数
4. A Fine-Grained Annotated Multi-Dialectal Arabic Corpus [C] . Anis Charfi, Wajdi Zaghouani, Syed Hassan Mehdi, International conference on recent advances in natural language processing . 2019

机译：一个细粒度的注释多方面方向性阿拉伯语药
5. Annotating a corpus of biomedical research texts: Two models of rhetorical analysis. [D] . White, Barbara Ellen. 2010

机译：注释生物医学研究文献集：修辞分析的两种模型。
6. GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics Informatics to Support Biomedical Information Extraction [O] . So-Yeon Oh, Ji-Hyeon Kim, Seo-Jin Kim, 2018

机译：GNI语料库版本1.0：带注释的基因组学和信息学全文语料库支持生物医学信息提取
7. Using Twitter to collect a multi-dialectal corpus of Arabic [O] . Hamdy Mubarak, Kareem Darwish 2014

机译：使用Twitter收集阿拉伯语的多方言语料库

A Fine-Grained Annotated Multi-Dialectal Arabic Corpus

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅