首页> 外文会议>Computational linguistics for linguistic complexity >Automatic Construction of Large Readability Corpora
【24h】

Automatic Construction of Large Readability Corpora

机译:大型可读性语料库的自动构建

获取原文
获取原文并翻译 | 示例

摘要

This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the former already produce F-measures of over 0.75 for Portuguese texts, but the use of additional features results in even better results, in most cases. For English, shallow features also perform well as do classic readability formulas. Comparing different classifiers for the task, logistic regression obtained, in general, the best results, but with considerable differences between the results for two and those for three-classes, especially regarding the intermediary class. Given the large scale of the resulting corpus, for evaluation we adopt the agreement between different classifiers as an indication of readability assessment certainty. As a result of this work, a large corpus for Brazilian Portuguese was built1, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.
机译:这项工作提出了一个按可读性级别自动构建大型Web语料库的框架。我们比较了针对可读性评估任务的不同机器学习分类器,重点是葡萄牙语和英语文本,分析了变量(如结果语料库中使用的功能清单)的影响。在比较浅层特征和深层特征时,前者已经为葡萄牙语文本产生了超过0.75的F度量,但是在大多数情况下,使用附加特征会产生更好的结果。对于英语,浅层功能和经典的可读性公式也表现良好。通过比较任务的不同分类器,逻辑回归总体上可获得最佳结果,但是两个类别和三个类别的结果之间存在相当大的差异,尤其是在中间类别方面。鉴于所得语料库的规模很大,为了进行评估,我们采用了不同分类器之间的协议,以指示可读性评估的确定性。这项工作的结果是,为巴西葡萄牙语建立了一个大型语料库,其中包括170万个文档和大约16亿个令牌,这些令牌已经解析并带有134种不同的文本属性并加上了不同分类器之间的约定。

著录项

  • 来源
  • 会议地点 Osaka(JP)
  • 作者单位

    Institute of Informatics, Federal University of Rio Grande do Sul Av. Bento Goncalves, 9500, 91501-970, Porto Alegre, RS, Brazil;

    Institute of Informatics, Federal University of Rio Grande do Sul Av. Bento Goncalves, 9500, 91501-970, Porto Alegre, RS, Brazil;

    Institute of Informatics, Federal University of Rio Grande do Sul Av. Bento Goncalves, 9500, 91501-970, Porto Alegre, RS, Brazil;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号