首页> 外文期刊>Journal of Ambient Intelligence and Humanized Computing >DegExt: a language-independent keyphrase extractor
【24h】

DegExt: a language-independent keyphrase extractor

机译:DegExt:独立于语言的关键字提取器

获取原文
获取原文并翻译 | 示例
           

摘要

In this paper, we introduce DegExt, a graph-based language-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24, 2008). We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx (Turney in Inf Retr 2:303–336, 2000) and TextRank (Mihalcea and Tarau in Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain, 2004). We evaluated DegExt on collections of benchmark summaries in two different languages: English and Hebrew. Our experiments on the English corpus show that DegExt significantly outperforms TextRank and GenEx in terms of precision and area under curve for summaries of 15 keyphrases or more at the expense of a mostly non-significant decrease in recall and F-measure, when the extracted phrases are matched against gold standard collection. Due to DegExt’s tendency to extract bigger phrases than GenEx and TextRank, when the single extracted words are considered, DegExt outperforms them both in terms of recall and F-measure. In the Hebrew corpus, DegExt performs the same as TextRank disregarding the number of keyphrases. An additional experiment shows that DegExt applied to the TextRank representation graphs outperforms the other systems in the text classification task. For documents in both languages, DegExt surpasses both GenEx and TextRank in terms of implementation simplicity and computational complexity.
机译:在本文中,我们介绍了DegExt,它是一种基于图形的语言独立的关键字提取器,它扩展了Litvak和Last(基于图形的关键字提取以用于单文档摘要)中描述的关键字提取方法。在:多文档研讨会上,源多语言信息提取和汇总,第17-24页,2008年)。我们将DegExt与两种最先进的密钥短语提取方法进行了比较:GenEx(Infr Retr 2:303–336中的Turney,2000年)和TextRank(Textrank中的Mihalcea和Tarau)—将文本编入顺序。自然语言处理中的经验方法会议(西班牙巴塞罗那,2004年)。我们以两种不同语言(英语和希伯来语)的基准摘要集合对DegExt进行了评估。我们对英语语料库的实验表明,对于15个或更多短语的摘要,DegExt在精度和曲线下面积方面明显优于TextRank和GenEx,但当提取短语时,回想和F度量的降低几乎没有显着降低与黄金标准收藏相匹配。由于DegExt倾向于提取比GenEx和TextRank更大的短语,因此当考虑单个提取的单词时,DegExt在回忆和F量度方面均胜过它们。在希伯来语语料中,DegExt的执行与TextRank相同,而忽略了关键短语的数量。另一个实验表明,应用于TextRank表示图的DegExt优于文本分类任务中的其他系统。对于两种语言的文档,DegExt在实现简单性和计算复杂性方面都超过了GenEx和TextRank。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号