【24h】

Keyword Extraction and Headline Generation Using Novel Word Features

机译:使用新颖的Word功能提取关键字并生成标题

获取原文

摘要

We introduce several novel word features for keyword extraction and headline generation. These new word features are derived according to the background knowledge of a document as supplied by Wikipedia. Given a document, to acquire its background knowledge from Wikipedia, we first generate a query for searching the Wikipedia corpus based on the key facts present in the document. We then use the query to find articles in the Wikipedia corpus that are closely related to the contents of the document. With the Wikipedia search result article set, we extract the inlink, outlink, category and in-fobox information in each article to derive a set of novel word features which reflect the document's background knowledge. These newly introduced word features offer valuable indications on individual words' importance in the input document. They serve as nice complements to the traditional word features derivable from explicit information of a document. In addition, we also introduce a word-document fitness feature to characterize the influence of a document's genre on the keyword extraction and headline generation process. We study the effectiveness of these novel word features for keyword extraction and headline generation by experiments and have obtained very encouraging results.
机译:我们介绍了几种新颖的单词功能,用于关键字提取和标题生成。这些新单词特征是根据Wikipedia提供的文档的背景知识得出的。给定一个文档,要从Wikipedia中获取其背景知识,我们首先会基于文档中存在的关键事实生成一个查询,以搜索Wikipedia语料库。然后,我们使用查询在Wikipedia语料库中查找与文档内容密切相关的文章。使用Wikipedia搜索结果文章集,我们提取每篇文章中的内联,外联,类别和内装信息,以得出反映文档背景知识的一组新颖的单词功能。这些新引入的单词功能为输入文档中各个单词的重要性提供了有价值的指示。它们是对文档显式信息派生的传统单词功能的很好补充。此外,我们还引入了单词文档适应度功能,以表征文档类型对关键字提取和标题生成过程的影响。我们通过实验研究了这些新颖的单词特征对关键词提取和标题生成的有效性,并获得了令人鼓舞的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号