This thesis is an inquiry into the importance of incorporating domain knowledge into emerging information distillation tasks which are in principle similar to that of text summarization, but in practice require techniques that are not adequately addressed in previous work. Tasks being analyzed are headline generation, biography creation, online discussion summarization, and automatic evaluation for summaries. This thesis shows empirically that while traditional text summarization techniques are designed for generic summarization tasks, they cannot be readily applied to the above four tasks. Each task requires prior knowledge on the operating domain, data type, task structure, and output structure. Techniques and algorithms designed with this knowledge perform significantly better than the ones without.; This thesis explores the solutions to headline generation, or the generation of summaries of very short length. By identifying features that are specific to headlines, a keyword selection model was designed to select words that are headline-worthy. Context information surrounding these headline words are extracted to produce phrase-based headlines.; Typical question-answering systems target definition questions and produce factoid answers. However, when questions require complex answers, like "who is x" questions, a biography creation engine is required to address the problem. Categorizing a person's life into multiple classes of information, the engine becomes a classification engine, coupled with extraction and re-ranking algorithms, and produces biographies on every aspects of a person's life.; The emergence of multi-party conversations recorded in text, such as online discussions, prompted development and analyses on the summarization of such data input. Recognizing the speech aspect of this type of information, including modeling subtopic structures and the exchanges between multiple speakers, shows a significantly better quality of summaries, whose constructions are also in accordance with what human summary writers do.; Text summarization evaluation previously had been limited to manual annotation or comparison on lexical identity. What separates manual and automatic matching is the ability to paraphrase, which makes automatic metrics extremely venerable. This thesis provides a solution to bridge the gap by using a large paraphrase collection that is acquired through applying statistical phrase-based machine translation (MT) algorithms on parallel data. This procedure produces a significantly higher correlation with human judgments and can become an objective function as part of a summarization system.
展开▼