首页> 外文会议>International Conference on Asian Language Processing >New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool
【24h】

New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool

机译:用于阿拉伯语的新语言资源:包含超过200万字的语料库和语料库处理工具

获取原文

摘要

Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called "Khawas ÛæÇÕ" ("diver" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.
机译:阿拉伯语是一种相对于其他语言的资源匮乏的语言,具有类似的扬声器。这种情况对阿拉伯语的基于语料库的语言研究产生了负面影响,并在较小程度上进行阿拉伯语处理。本文介绍了近期可自由的阿拉伯数集团和语料库加工工具的简要概述,它检查了可能阻止阿拉伯语言学家使用的一些问题。这些问题揭示了新语言资源,以丰富和促进基于阿拉伯语语料库的研究。因此,本文介绍了一种新的阿拉伯语语料库的设计,包括基于来自所有阿拉伯国家的报纸,包括超过200万字的现代标准阿拉伯品种,它还描述了专门为阿拉伯语设计的语料库处理工具的主要特征,叫做" khawasÛæÇÕ" ("潜水员"英文)。 Khawas提供了比阿拉伯语的任何其他可自由的语料库处理工具提供更多功能,包括N-Gram频率和一致,搭配和两种Corea的统计比较。最后,我们大纲修改和改进可以在未来的作品中进行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号