首页> 外文会议>LREC-2012 >Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization
【24h】

Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization

机译:通过众包,光学滤波和共参考标准化监督主题关键词提取新闻故事

获取原文

摘要

Fast and effective automated indexing is critical for search and personalized services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. In this paper, we investigate the use of additional semantic features and pre-processing steps to improve automatic key phrase extraction. These features include the use of signal words and freebase categories. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. We also needed a "Gold Standard" - a set of labeled documents for training and evaluation. While the subjective nature of key phrase selection precludes a true "Gold Standard", we used Amazon's Mechanical Turk service to obtain a useful approximation. Our data indicates that the biggest improvements in performance were due to shallow semantic features, news categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of deeper semantic features such as Freebase sub-categories was not beneficial by itself, but in combination with pre-processing, did cause slight improvements in the nDCG scores.
机译:快速且有效的自动索引对于搜索和个性化服务至关重要。由一个或多个单词组成的关键短语,并表示文档的主要概念通常用于索引的目的。在本文中,我们调查了额外的语义特征和预处理步骤来改善自动关键短语提取。这些功能包括使用信号字和自由级类别。其中一些功能导致结果的准确性改进。我们还尝试了两种形式的文档预处理,我们称之为光滤波和共参考标准化。灯光过滤从文档中删除句子,这些句子被判断为其主要内容的外围设备。共同参考标准化将几种单独的命名实体统一到唯一的形式中。我们还需要一个“黄金标准” - 一套标签的培训和评估文件。虽然关键短语选择的主观性质排除了真正的“金标”,但我们使用了亚马逊的机械机械服务来获得有用的近似值。我们的数据表明,性能最大的改进是由于浅层的语义特征,新闻类别和修辞信号(NDCG 78.47%与68.93%)。包含更深层次的语义特征,如自由比例的子类别本身并不有益,但与预处理相结合,确实导致NDCG评分略有改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号