首页> 外文会议>European Conference on IR Research(ECIR 2005); 20050321-23; Santiago de Compostela(ES) >Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms
【24h】

Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms

机译:基于词簇和排名算法的文本自动摘要

获取原文
获取原文并翻译 | 示例

摘要

This paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is (rained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets - the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline - non learning - systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.
机译:本文研究了一种基于机器学习排名算法的单文档摘要化新方法。使用机器学习技术来完成此任务,可以使摘要适应用户需求和语料库特征。在过去的几年中,这些理想的特性激发了该领域越来越多的工作。大多数方法都尝试通过提取文本跨度(在我们的情况下为句子)来生成摘要,并采用分类框架,该框架由训练分类器组成,以便区分文档的相关跨度和无关跨度。首先使用一组功能为给定文档中的每个句子生成分数矢量,并且对分类器进行了训练(以使这些分数成为全局组合。我们认为训练分类器的分类标准不适用为SDS提出了一个基于该任务的排名的原始框架,排名算法也结合了不同特征的得分,但其判据倾向于减少文档中句子的相对错序。这些聚类是经常彼此共同出现的词组,可以用来扩展查询或丰富文档句子的表示形式,我们对性能进行了分析我们对两种数据集的排名算法的评估-TIPSTER SUMMAC的计算和语言(cmp_lg)集合和WIPO集合。我们使用不同的基准线(非学习系统)进行比较,以及基于分类框架的可参考的参考汇总器系统。实验表明,学习算法的性能优于非学习系统,而排序算法的性能优于分类器。两种学习算法之间的性能差异取决于数据集的性质。我们通过两种学习算法得出的数据的不同可分离性假设对此事实进行解释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号