...
首页> 外文期刊>Web Intelligence and Agent Systems >DIKEA: Exploiting Wikipedia for keyphrase extraction
【24h】

DIKEA: Exploiting Wikipedia for keyphrase extraction

机译:DIKEA:利用Wikipedia提取关键短语

获取原文
获取原文并翻译 | 示例

摘要

Automatic keyphrase extraction is the challenging task of assigning keyphrases to documents to capture the main topics. It assists many research areas in the field of text mining - indexing, clustering, and summarisation. A landmark research KEA (Keyphrase Extraction Algorithm) formulated the problem as a supervised machine learning problem and successfully applied a Naive Bayes model to it. KEA showed great promise but its performance is not satisfactory. Its state-of-art extension KEA++ significantly improved its performance but relies on a domain specific vocabulary which is often not available or incomplete for other domains. We present a novel domain-independent system (DIKEA) which makes three main contributions to this field of research: utilising the largest online knowledge source available, Wikipedia, for keyphrase candidate selection; adding new features including a Wikipedia-based feature, link probability; and further boosting performance by using a multilayer perceptron network. Our experiments showed that DIKEA outperformed KEA++ while keeping the overall solution domain-independent. DIKEA was also tested on a benchmark dataset provided by a workshop on Semantic Evaluation (SemEval-2010), allowing comparisons with the 19 other related systems which participated. Our experiments show that DIKEA ranks first when considering only the top 5 keyphrases extracted from each document, and ranks second overall.
机译:自动关键词提取是一项艰巨的任务,即为文档分配关键词以捕获主要主题。它可以帮助文本挖掘领域的许多研究领域-索引编制,聚类和汇总。具有里程碑意义的研究KEA(关键短语提取算法)将该问题表述为有监督的机器学习问题,并成功地将Naive Bayes模型应用于该问题。 KEA表现出了很大的希望,但其表现并不令人满意。它最先进的扩展KEA ++大大提高了其性能,但依赖于特定领域的词汇表,而其他领域通常不可用或不完整。我们提出了一种新颖的领域独立系统(DIKEA),该系统对这一研究领域做出了三项主要贡献:利用最大的在线知识资源(维基百科)来选择关键短语;添加新功能,包括基于Wikipedia的功能,链接概率;通过使用多层感知器网络进一步提高性能。我们的实验表明,DIKEA在保持整体解决方案与域无关的同时,胜过了KEA ++。 DIKEA还通过语义评估研讨会(SemEval-2010)提供的基准数据集进行了测试,从而可以与参与的其他19个相关系统进行比较。我们的实验表明,仅考虑从每个文档中提取的前5个关键短语时,DIKEA排名第一,而总体排名第二。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号