首页> 外文期刊>Expert Systems with Application >Improving lexical coverage of text simplification systems for Spanish
【24h】

Improving lexical coverage of text simplification systems for Spanish

机译:提高西班牙语文本简化系统的词汇覆盖率

获取原文
获取原文并翻译 | 示例

摘要

The current bottleneck of all data-driven lexical simplification (LS) systems is scarcity and small size of parallel corpora (original sentences and their manually simplified versions) used for training. This is especially pronounced for languages other than English. We address this problem, taking Spanish as an example of such a language, by building new simplification-specific datasets of synonyms and paraphrases using freely available resources. We test their usefulness in the LS task by adding them, in various combinations, to the existing text simplification (TS) training dataset in a phrase-based statistical machine translation (PBSMT) approach. Our best systems significantly outperform the state-of-the-art LS systems for Spanish, by the number of transformations performed and the grammaticality, simplicity and meaning preservation of the output sentences. The results of a detailed manual analysis show that some of the newly built TS resources, although they have a good lexical coverage and lead to a high number of transformations, often change the original meaning and do not generate simpler output when used in this PBSMT setup. The good combinations of these additional resources with the TS training dataset and a good choice of language model, in contrast, improve the lexical coverage and produce sentences which are grammatical, simpler than the original, and preserve the original meaning well. (C) 2018 Elsevier Ltd. All rights reserved.
机译:当前所有数据驱动的词法简化(LS)系统的瓶颈是用于训练的并行语料库(原始句子及其手动简化版本)的稀缺性和小规模。对于英语以外的其他语言尤其如此。我们使用西班牙语作为这种语言的示例,通过使用可免费获得的资源构建新的简化了同义词和释义的数据集,从而解决了这个问题。我们通过将它们以各种组合添加到基于短语的统计机器翻译(PBSMT)方法中的现有文本简化(TS)训练数据集中,来测试它们在LS任务中的有用性。通过执行的转换次数以及输出语句的语法性,简单性和含义保留性,我们最好的系统大大优于西班牙语的LS系统。详尽的人工分析结果表明,一些新建的TS资源尽管具有良好的词法覆盖率并导致大量转换,但在此PBSMT设置中使用时,通常会改变其原始含义并且不会生成更简单的输出。相比之下,这些额外资源与TS训练数据集的良好组合以及语言模型的良好选择,可以提高词汇覆盖率,并产生语法上比原文更简单的句子,并很好地保留原文的含义。 (C)2018 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号