首页> 外文学位 >The importance of using domain knowledge in solving information distillation problems.
【24h】

The importance of using domain knowledge in solving information distillation problems.

机译:在解决信息提炼问题中使用领域知识的重要性。

获取原文
获取原文并翻译 | 示例

摘要

This thesis is an inquiry into the importance of incorporating domain knowledge into emerging information distillation tasks which are in principle similar to that of text summarization, but in practice require techniques that are not adequately addressed in previous work. Tasks being analyzed are headline generation, biography creation, online discussion summarization, and automatic evaluation for summaries. This thesis shows empirically that while traditional text summarization techniques are designed for generic summarization tasks, they cannot be readily applied to the above four tasks. Each task requires prior knowledge on the operating domain, data type, task structure, and output structure. Techniques and algorithms designed with this knowledge perform significantly better than the ones without.; This thesis explores the solutions to headline generation, or the generation of summaries of very short length. By identifying features that are specific to headlines, a keyword selection model was designed to select words that are headline-worthy. Context information surrounding these headline words are extracted to produce phrase-based headlines.; Typical question-answering systems target definition questions and produce factoid answers. However, when questions require complex answers, like "who is x" questions, a biography creation engine is required to address the problem. Categorizing a person's life into multiple classes of information, the engine becomes a classification engine, coupled with extraction and re-ranking algorithms, and produces biographies on every aspects of a person's life.; The emergence of multi-party conversations recorded in text, such as online discussions, prompted development and analyses on the summarization of such data input. Recognizing the speech aspect of this type of information, including modeling subtopic structures and the exchanges between multiple speakers, shows a significantly better quality of summaries, whose constructions are also in accordance with what human summary writers do.; Text summarization evaluation previously had been limited to manual annotation or comparison on lexical identity. What separates manual and automatic matching is the ability to paraphrase, which makes automatic metrics extremely venerable. This thesis provides a solution to bridge the gap by using a large paraphrase collection that is acquired through applying statistical phrase-based machine translation (MT) algorithms on parallel data. This procedure produces a significantly higher correlation with human judgments and can become an objective function as part of a summarization system.
机译:本文对将领域知识纳入新兴的信息提炼任务的重要性进行了研究,这些任务在原则上与文本概述类似,但实际上需要先前工作中未充分解决的技术。正在分析的任务是标题生成,传记创建,在线讨论摘要以及摘要的自动评估。本文从经验上表明,尽管传统的文本摘要技术是为通用摘要任务设计的,但它们不能轻易应用于上述四个任务。每个任务都需要有关操作域,数据类型,任务结构和输出结构的先验知识。以此知识设计的技术和算法的性能明显优于没有知识和技术的算法。本文探讨了标题生成或长度很短的摘要生成的解决方案。通过识别标题的特定功能,设计了关键字选择模型来选择值得标题使用的单词。提取围绕这些标题词的上下文信息以产生基于短语的标题。典型的问答系统以定义问题为目标,并产生事实性答案。但是,当问题需要复杂的答案(例如“谁是x”)时,就需要传记创建引擎来解决该问题。该引擎将一个人的生活分为多种信息类别,成为分类引擎,再结合提取和重新排序算法,并生成有关该人生活各个方面的传记。以文本形式记录的多方对话的出现,例如在线讨论,促使人们发展和分析这种数据输入的摘要。认识到这类信息的言语方面,包括对子主题结构进行建模以及在多方讲话者之间进行交流,表明摘要的质量明显提高,摘要的结构也与人类摘要作者的工作相符。以前的文本摘要评估仅限于人工注释或词汇身份比较。手动匹配和自动匹配之间的区别在于释义功能,这使自动度量标准变得极为重要。本文提供了一种解决方案,通过使用大型释义集合来弥补差距,该释义是通过对并行数据应用基于统计短语的机器翻译(MT)算法获得的。该程序与人的判断产生显着更高的相关性,并且可以成为摘要系统一部分的目标功能。

著录项

  • 作者

    Zhou, Liang.;

  • 作者单位

    University of Southern California.;

  • 授予单位 University of Southern California.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 155 p.
  • 总页数 155
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号