首页> 外文会议>International Conference on Language Resources and Evaluation >A Corpus for Automatic Readability Assessment and Text Simplification of German
【24h】

A Corpus for Automatic Readability Assessment and Text Simplification of German

机译:一种自动可读性评估和文本简化的语料库

获取原文

摘要

In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.
机译:在本文中,我们提出了一种用于自动可读性评估和德语自动文本简化的语料库。语料库是从Web源编译的,并由与单声道(简化的德语)数据相当于大约6,200件文件(近211,000句)并行。作为一个唯一的特征,语料库包含关于文本结构的信息(例如,段落,线),排版(例如,字体类型,字体样式)和图像(内容,位置和尺寸)。虽然在文献中考虑涉及简化语言的机器学习任务中的重要性,但在文献中反复强调,但我们为其利益提供了经验证据。我们还通过应用反平移,通过应用数据增强技术来展示利用单声道的自动文本简化的自动文本简化的附加值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号