首页> 外文期刊>International journal of computational linguistics and applications >Thematically Reinforced Explicit Semantic Analysis
【24h】

Thematically Reinforced Explicit Semantic Analysis

机译:主题强化的显式语义分析

获取原文
获取原文并翻译 | 示例
           

摘要

We present an extended, thematically reinforced version of Gabrilovich and Markovitch 's Explicit Semantic Analysis (ESA), where we obtain thematic information through the category structure of Wikipedia. For this we first define a notion of categorical tfidf which measures the relevance of terms in categories. Using this measure as a weight we calculate a maximal spanning tree of the Wikipedia corpus considered as a directed graph of pages and categories. This tree provides us with a unique path of "most related categories" between each page and the top of the hierarchy. We reinforce tfidf of words in a page by aggregating it with categorical tfidfs of the nodes of these paths, and define a thematically reinforced ESA semantic relatedness measure which is more robust than standard ESA and less sensitive to noise caused by out-of-context words. We apply our method to the French Wikipedia corpus, evaluate it through a text classification on a 37.5 MB corpus of 20 French newsgroups and obtain a precision increase of 9-10% compared with standard ESA.
机译:我们介绍了Gabrilovich和Markovitch的显式语义分析(ESA)的扩展,主题增强版本,在其中,我们通过Wikipedia的类别结构获取主题信息。为此,我们首先定义分类tfidf的概念,该概念测量类别中术语的相关性。使用此度量作为权重,我们计算出Wikipedia语料库的最大生成树,该树被视为页面和类别的有向图。此树为我们提供了每个页面与层次结构顶部之间的“最相关类别”的唯一路径。我们通过将页面上的单词的tfidf与这些路径的节点的分类tfidfs进行聚合来增强单词的tfidf,并定义一种主题增强的ESA语义相关性度量,该度量比标准ESA更健壮,并且对上下文外单词引起的噪声不那么敏感。我们将方法应用于法国维基百科语料库,通过对20个法国新闻组的37.5 MB语料库进行文本分类来对其进行评估,与标准ESA相比,其准确度提高了9-10%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号