This thesis is about automatic document summarization, with experimental results on general, query, update and comparative multi-document summarization (MDS). We describe prior work and our own improvements on some important aspects of a summarization system, including text modeling by means of a graph and sentence selection via archetypal analysis. The centerpiece of this work is a novel method for summarization that we call “Archetypal Analysis Summarization”.udArchetypal Analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. We propose a novel AA based summarization method based on following observations. In generic document summarization, given a graph representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. To compute these extreme values, general or weighted archetypes, we choose to use archetypal analysis and weighted archetypal analysis, respectively. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e. convex combinations of the original sentences. Since AA in this way readily offers soft clustering and probabilistic ranking, we suggest considering it as a method for simultaneous sentence clustering and ranking. Another important argument in favour of using AA in MDS is that in contrast to other factorization methods which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences, thus induces variability and diversity in produced summaries. Our research contributes by presenting some new modeling approaches based on graph notation which facilitate the text summarization task. We investigate the impact of using the content-graph and multi-element graph model for language- and domain-independent extractive multi-document generic and query focused summarization. We also propose the novel version of AA, the weighted Hierarchical Archetypal Analysis. We consider the use of it for four best-known summarization tasks, including generic, query-focused, update, and comparative summarization. Experiments on summarization data sets (DUC04-07, TAC08) are conducted to demonstrate the efficiency and effectiveness of our framework for all four kinds of the multi-document summarization task.ud
展开▼
机译:本文是关于自动文档摘要的,并在通用,查询,更新和比较多文档摘要(MDS)方面取得了实验结果。我们描述了摘要系统某些重要方面的先前工作和我们自己的改进,包括通过图形进行文本建模和通过原型分析进行句子选择。这项工作的核心是一种新颖的汇总方法,我们称之为“ Archetypal Analysis Summarization”。 udArchetypal Analysis(AA)是一种很有前途的无监督学习工具,能够完全组合聚类的优势和矩阵分解的灵活性。基于以下观察,我们提出了一种新颖的基于AA的汇总方法。在一般文档摘要中,给定一组文档的图形表示,正面和/或负面显着句子是数据集边界上的值。要计算这些极值,常规原型或加权原型,我们分别选择使用原型分析和加权原型分析。虽然将数据集中的每个句子估计为原型句子的混合,但原型本身仅限于稀疏混合,即原始句子的凸组合。由于AA通过这种方式很容易提供软聚类和概率排名,因此我们建议将其视为同时进行句子聚类和排名的方法。支持在MDS中使用AA的另一个重要论点是,与其他提取原型,特征甚至基本语句的分解方法相反,AA选择了不同的(原型)语句,从而在生成的摘要中引起了变异和多样性。我们的研究通过提出一些新的基于图形表示法的建模方法做出了贡献,这些方法有助于文本摘要任务。我们调查使用内容图和多元素图模型进行语言和域无关的提取多文档通用和以查询为重点的摘要的影响。我们还提出了AA的新颖版本,即加权层次原型分析。我们考虑将其用于四个最著名的摘要任务,包括常规摘要,以查询为中心的摘要,更新和比较摘要。进行了摘要数据集(DUC04-07,TAC08)的实验,以证明我们的框架针对所有四种多文档摘要任务的效率和有效性。 ud
展开▼