首页> 外文期刊>Expert systems with applications >SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders
【24h】

SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders

机译:SimmoDer:基于深度自动编码器的提取文本摘要的无监督框架

获取原文
获取原文并翻译 | 示例
       

摘要

In this paper, we propose SummCoder, a novel methodology for generic extractive text summarization of single documents. The approach generates a summary according to three sentence selection metrics formulated by us: sentence content relevance, sentence novelty, and sentence position relevance. The sentence content relevance is measured using a deep auto-encoder network, and the novelty metric is derived by exploiting the similarity among sentences represented as embeddings in a distributed semantic space. The sentence position relevance metric is a hand-designed feature, which assigns more weight to the first few sentences through a dynamic weight calculation function regulated by the document length. Furthermore, a sentence ranking and a selection technique are developed to generate the document summary by ranking the sentences according to the final score obtained through the fusion of the three sentences selection metrics. We also introduce a new summarization benchmark, Tor Illegal Documents Summarization (TIDSumm) dataset, especially to assist Law Enforcement Agencies (LEAs), that contains two sets of ground truth summaries, manually created, for 100 web documents extracted from onion websites in Tor (The Onion Router) network. Empirical results show that, on DUC 2002, on Blog Summarization, and on TIDSumm datasets, our text summarization approach obtains comparable or better performance than the state-of-the-art methods for different ROUGE metrics. (C) 2019 Elsevier Ltd. All rights reserved.
机译:在本文中,我们提出了汇总器,这是一篇新的单一文件的通用提取文本的方法。该方法根据由美国制定的三句选择指标生成摘要:句子内容相关性,句子新颖性和句子位置相关性。使用深度自动编码器网络测量句子内容相关性,通过利用表示作为分布式语义空间中的嵌入式的句子之间的相似性来导出新颖的度量。句子位置相关度量是一种手工设计的功能,通过文档长度调节的动态权重计算功能为前几句分配更多权重。此外,开发了句子排名和选择技术来通过根据通过三个句子选择度量的融合获得的最终分数来排序句子来生成文档摘要。我们还介绍了一个新的摘要基准,Tor非法文件摘要(TIDUMM)数据集,尤其是协助执法机构(LES),其中包含从Tor中提取的100个Web文档的手动创建了两组地面真理摘要(洋葱路由器)网络。经验结果表明,在DUC 2002上,在博客摘要和TIDSUMM数据集上,我们的文本摘要方法比不同的胭脂指标的最先进方法获得了可比或更好的性能。 (c)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号