首页> 外文会议>AIRS 2012 >Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds
【24h】

Building, Profiling, Analysing and Publishing an Arabic News Corpus Based on Google News RSS Feeds

机译:基于Google News RSS饲料,建立,分析,分析和发布阿拉伯新闻语料库

获取原文

摘要

The aim of this paper is to give a detailed and explicit design, composition and documentation of a new Arabic News Corpus (ArNeCo). We used RSS feeds from Google news as a big container of article titles, and crawled the web to extract the text. About 11,000 documents with more than 6 million words were tagged as belonging to one of 6 domains: Business, Entertainment, Health, Science-Technology, Sports, and World. Metadata has been added to the corpus as a whole and to each domain independently. The developed corpus, called ArNeCo, has been analysed to ensure that it has a considerable quality and quantity, and published on the Internet for research purposes. This article aims to help potential users of ArNeCo to understand the nature of the corpus and to do information retrieval research in many ways such as in the formulation of queries, justification of decisions taken or interpretation of results gained. Besides the corpus, this article presents a method for developing corpora that can keep track of recent natural language texts posted on the Internet by using RSS feeds.
机译:本文的目的是提供一个新的阿拉伯新闻语料库(Arneco)的详细和明确的设计,组成和文件。我们使用Google News的RSS Feed作为文章标题的大容器,并爬网以提取文本。大约11,000名具有超过600万字的文件被标记为属于6个域名:商业,娱乐,健康,科学技术,体育和世界之一。元数据已作为整个语料库添加到每个域名。已经分析了已发达的语料库,称为Arneco,以确保它具有相当大的质量和数量,并在互联网上发表以进行研究。本文旨在帮助Arneco的潜在用户了解语料库的性质,并以许多方式进行信息检索研究,例如在制定查询中,所取决于或解释所获得的结果的理由。除了语料库之外,本文介绍了一种开发Corpora的方法,可以通过使用RSS馈送来跟踪最近在互联网上发布的自然语言文本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号