【24h】

Crawling by Readability Level

机译:通过可读性水平爬行

获取原文

摘要

The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www. inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawled WaC classified by readability level.
机译:可读性评估领域的研究的注释语料库仍然非常有限。另一方面,研究人员越来越多地用于书面内容的来源,以在网站中构建非常大而富有的基层,作为语料库(WAC)计划。本文提出了一种自动生成典型的大型公司的框架。它采用监督学习方法,以基于具有低计算成本的特征来纳入可读性滤波器,以将针对特定读取级别的文本收集。我们通过将可读性评估的网站爬到引用语料库进行比较来评估此框架(这两种Corpora在http:// www中提供。inf.ufrgs.br/pln/resource/crawlingbyreadabilitylevel.zip。)。获得的结果表明,这些特征擅长将文本与其他级别分离出来的文本(初始等级)。由于这项工作,建造了两种葡萄牙语学数:按年级水平分类的维基利沃尔可读性语料库,并通过可读性水平逐渐追逐WAC。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号