首页> 外文会议>International Conference on Information Retrieval and Knowledge Management >Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)
【24h】

Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)

机译:构建古典阿拉伯语命名实体识别语料库(Canercorpus)

获取原文

摘要

The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.
机译:过去十年已经见证了背景信息资源的建设,以克服文本挖掘任务中的几个挑战。对于具有较差知识来源的非英语语言,如阿拉伯语,这些挑战变得更加突出,特别是处理需要人为注释的自然语言处理应用程序。在命名实体识别(NER)任务中,已经引入了几项研究以解决阿拉伯语在形态学和句法变异方面的复杂性。然而,有少数研究处理古典阿拉伯语(CA),即古兰经和圣训。 CA也用于归类伊斯兰主题,其中包含许多有用信息的有用信息,如果提取。因此,在本文中,我们将古典阿拉伯语命名实体识别语料库介绍为标记数据的新语料库,可以用于处理识别阿拉伯语命名实体的问题。它是由人类专家自由提供的,手动注释,含有超过7,000个圣训。基于伊斯兰主题,我们将命名实体分为20种类型,其中包括尚未以真主,先知,天堂,地狱和宗教等待处理的特定域实体。在这项工作期间详细描述了标准和古典阿拉伯语之间的差异。此外,引入了综合统计分析来衡量在手工人体注释中发挥重要作用的因素。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号