首页> 外文OA文献 >Generating Natural Language Summaries from Multiple On-Line Sources: Language Reuse and Regeneration
【2h】

Generating Natural Language Summaries from Multiple On-Line Sources: Language Reuse and Regeneration

机译:从多个在线来源生成自然语言摘要:语言重用和再生

摘要

The abundance of news wire on the World-Wide Web has resulted in at least four major problems, which seem to present the most interesting challenges to users and researchers alike: size,heterogeneity, change, and conflicting information. Size: several hundred newspapers and news agencies maintain their Web sites with thousands of news stories in each. Heterogeneity: some of the data related to news is in structured format (e.g., tables); more exists in semi-structured format (e.g.,Web pages, encyclopedias, textual databases); while the rest of the data is in textual form (e.g., newswire). Change: most Web sites and certainly all news sources change on a daily basis. Disagreement: different sources present conflicting or at least different views of the same event. We have approached the second, third, and fourth of these four problems from the point of view of text generation. We have developed a system, {scsummons}, which when coupled with appropriate information extraction technology, generates a specific genre of natural language summaries of a particular event (which we call briefings) in a restricted domain. The briefings are concise, they contain facts from multiple and heterogeneous sources, and incorporate evolving information, highlighting agreements and contradictions among sources on the same topic. We have developed novel techniques and algorithms for combining data from multiple sources at the conceptual level (using natural language understanding), for identifying new information on a given topic; and for presenting the information in natural language form to the user. We named the framework that we have developed for these problems {em language reuse and regeneration} (LRR). Its novelty lies in the ability to produce text by collating together text already written by humans on the Web. The main features of LRR are: increased robustness through a simplified parsing/generation component, leverage on text already written by humans, and facilities for the inclusion of structured data in computer-generated text. The present thesis contains an introduction to LRR and its use inmulti-document summarization. We have paid special attention to the techniquesfor producing conceptual summaries of multiple sources, to the creation and useof a LRR-based lexicon for text generation, to a methodology used to identifynew and old information in threads of documents, and to the generation offluent natural language text using all the components above. The thesis contains evaluations of the different components of {sc summons} aswell as certain aspects of LRR as a methodology. A review of the relevantliterature is included as a separate chapter.
机译:万维网上大量的新闻通讯导致至少四个主要问题,这些问题似乎给用户和研究人员都带来了最有趣的挑战:规模,异构性,变化和信息冲突。规模:数百家报纸和新闻社维护其网站,每家网站都有数千个新闻报道。异构性:与新闻相关的某些数据采用结构化格式(例如表格);半结构化格式(例如网页,百科全书,文本数据库)中存在更多内容;而其余数据则采用文本形式(例如新闻专线)。变化:大多数网站以及当然所有新闻源每天都在变化。分歧:不同的消息来源对同一事件存在冲突或至少存在不同的观点。从文本生成的角度来看,我们已经解决了这四个问题中的第二,第三和第四个问题。我们已经开发了{ scsummons}系统,该系统与适当的信息提取技术结合使用时,会在受限域中生成特定事件的特定类型的自然语言摘要(我们称为简报)。简报简明扼要,其中包含来自多个不同来源的事实,并结合了不断发展的信息,突出了同一主题的不同来源之间的协议和矛盾。我们已经开发出新颖的技术和算法,可以在概念级别(使用自然语言理解)组合来自多个来源的数据,以识别有关给定主题的新信息;用于以自然语言形式向用户展示信息。我们将针对这些问题开发的框架命名为{ em语言重用和再生}(LRR)。它的新颖之处在于能够通过将人类已经在网络上书写的文本整理在一起来生成文本。 LRR的主要特征是:通过简化的解析/生成组件提高了鲁棒性,利用了人类已经编写的文本,并在计算机生成的文本中包含了结构化数据。本文对LRR及其在多文档摘要中的使用进行了介绍。我们特别关注用于生成多个来源的概念摘要的技术,用于文本生成的基于LRR的词典的创建和使用,用于识别文档线索中新旧信息的方法以及流利的自然语言的生成使用上述所有组件的文本。本文包含对{ sc summons}的不同组成部分以及作为方法论的LRR某些方面的评估。有关相关文献的回顾作为单独的章节提供。

著录项

  • 作者

    Radev Dragomir R.;

  • 作者单位
  • 年度 1999
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"sq","name":"Albanian","id":41}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号