...
首页> 外文期刊>Journal of computer sciences >New Information Content Glossary Relatedness (ICGR) Approach for Short Text Similarity (STS) Tasks
【24h】

New Information Content Glossary Relatedness (ICGR) Approach for Short Text Similarity (STS) Tasks

机译:短文本相似性(STS)任务的新信息内容词汇相关性(ICGR)方法

获取原文
           

摘要

The measurement of the relatedness of word semantics based on complementary Wikipedia and WordNet-based methods takes two forms, combined and integrative, which are aimed at increasing the semantic space between related words. However, each form has its own set of issues regarding its components and the strategy that is used to combine and integrate corpus-based and knowledge-based methods. In the integrative strategy, a large corpus, such as Wikipedia, is used to extract a set of related words for a particular concept as a basis for searching the WordNet space. The drawback to this strategy is in its use of a fixed scaling parameter, which only fits an implemented dataset that is near to a human score. Other corpus-based methods use a cut-off threshold that is determined experimentally to reduce the semantic space and to increase the search for a more accurate semantic space. Such methods merely take into account the frequency of bigrams, while ignoring the frequency of individual terms. Knowledge-based methods using a gloss overlap have a similar limitation to the corpus-based methods, where they lead to the loss of many valuable relatedness features that determine a more accurate measurement. Thus, in this paper, a new Information Content Glossary Relatedness (ICGR) approach was proposed in two steps, namely, an Extended-PMI based on a cut-off density threshold was proposed to extract a Robust Relatedness Vector set (RVS) of a large Wikipedia dataset. Then, a Semantic Structural Information (SSI) method was presented to use the RVS as a fulcrum to define the most relatedness gloss in the WordNet of each gloss and to select the top 5 glosses related to each RVS. The results showed that the proposed approach outperformed the state-of-the-art set, where the Extended-PMI achieved a Spearman’s correlation of 0.89 to the human score and the ICGR approach achieved a Spearman’s correlation of 0.8 to the human score.
机译:基于互补的维基百科和基于WordNet的方法对词语义的相关性进行度量有两种形式,即组合形式和集成形式,旨在增加相关词之间的语义空间。但是,每种形式在其组成部分和用于组合和集成基于语料库和基于知识的方法的策略方面都有自己的问题。在集成策略中,大型语料库(例如Wikipedia)用于提取特定概念的一组相关词,作为搜索WordNet空间的基础。该策略的缺点是使用固定的缩放参数,该参数仅适合接近人类得分的已实现数据集。其他基于语料库的方法使用实验确定的截止阈值,以减少语义空间并增加对更准确语义空间的搜索。这样的方法仅考虑了二元组的频率,而忽略了单个项的频率。使用光泽重叠的基于知识的方法与基于语料库的方法具有类似的局限性,在这些方法中,它们导致许多有价值的相关性特征丢失,从而无法确定更准确的测量结果。因此,本文分两步提出了一种新的信息内容词汇相关度(ICGR)方法,即基于临界密度阈值的Extended-PMI提出了一种鲁棒相关度向量集(RVS)的提取方法。大型维基百科数据集。然后,提出了一种语义结构信息(SSI)方法,以将RVS用作支点来定义WordNet中每种光泽最相关的光泽,并选择与每个RVS相关的前5个光泽。结果表明,所提出的方法优于最新技术,其中Extended-PMI实现了Spearman与人类得分的相关性为0.89,而ICGR方法实现了Spearman与人类得分的相关性为0.8。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号