...
首页> 外文期刊>Information Retrieval >Cross-lingual training of summarization systems using annotated corpora in a foreign language
【24h】

Cross-lingual training of summarization systems using annotated corpora in a foreign language

机译:使用外语带注释语料库的汇总系统的跨语言培训

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The increasing trend of cross-border globalization and acculturation requires text summarization techniques to work equally well for multiple languages. However, only some of the automated summarization methods can be defined as “language-independent,” i.e., not based on any language-specific knowledge. Such methods can be used for multilingual summarization, defined in Mani (Automatic summarization. Natural language processing. John Benjamins Publishing Company, Amsterdam, 2001) as “processing several languages, with a summary in the same language as input”, but, their performance is usually unsatisfactory due to the exclusion of language-specific knowledge. Moreover, supervised machine learning approaches need training corpora in multiple languages that are usually unavailable for rare languages, and their creation is a very expensive and labor-intensive process. In this article, we describe cross-lingual methods for training an extractive single-document text summarizer called MUSE (MUltilingual Sentence Extractor)—a supervised approach, based on the linear optimization of a rich set of sentence ranking measures using a Genetic Algorithm. We evaluated MUSE’s performance on documents in three different languages: English, Hebrew, and Arabic using several training scenarios. The summarization quality was measured using ROUGE-1 and ROUGE-2 Recall metrics. The results of the extensive comparative analysis showed that the performance of MUSE was better than that of the best known multilingual approach (TextRank) in all three languages. Moreover, our experimental results suggest that using the same sentence ranking model across languages results in a reasonable summarization quality, while saving considerable annotation efforts for the end-user. On the other hand, using parallel corpora generated by machine translation tools may improve the performance of a MUSE model trained on a foreign language. Comparative evaluation of an alternative optimization technique—Multiple Linear Regression—justifies the use of a Genetic Algorithm.
机译:跨境全球化和文化融合的日益增长的趋势要求文本摘要技术对多种语言同样有效。但是,仅某些自动汇总方法可以定义为“与语言无关”,即不基于任何特定于语言的知识。这样的方法可用于多语言汇总,在Mani(自动汇总。自然语言处理。JohnBenjamins Publishing Company,阿姆斯特丹,2001年)中定义为“处理多种语言,并使用与输入相同的语言进行汇总”,但是它们的性能很高。由于排除了特定于语言的知识,因此通常不能令人满意。此外,受监督的机器学习方法需要使用多种语言训练语料库,而稀有语言通常无法使用这种语言,并且它们的创建是非常昂贵且劳动密集型的过程。在本文中,我们描述了一种跨语言方法,用于训练一种称为MUSE(多语言句子提取器)的抽取式单文档文本摘要生成器,这是一种受监督的方法,该方法基于使用遗传算法对一组丰富的句子排名度量进行线性优化。我们使用几种培训方案评估了MUSE在三种不同语言的文档上的性能:英语,希伯来语和阿拉伯语。汇总质量使用ROUGE-1和ROUGE-2 Recall指标进行测量。广泛的比较分析结果表明,在所有三种语言中,MUSE的性能均比最著名的多语言方法(TextRank)的性能更好。此外,我们的实验结果表明,跨语言使用相同的句子排名模型可产生合理的摘要质量,同时为最终用户节省大量注释工作。另一方面,使用由机器翻译工具生成的并行语料库可以提高使用外语训练的MUSE模型的性能。对替代性优化技术(多重线性回归)的比较评估证明了遗传算法的合理使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号