首页> 外文期刊>Data & Knowledge Engineering >A semantic similarity metric combining features and intrinsic information content
【24h】

A semantic similarity metric combining features and intrinsic information content

机译:结合特征和内在信息内容的语义相似性度量

获取原文
获取原文并翻译 | 示例

摘要

In many research fields such as Psychology, Linguistics, Cognitive Science and Artificial Intelligence, computing semantic similarity between words is an important issue. In this paper a new semantic similarity metric, that exploits some notions of the feature-based theory of similarity and translates it into the information theoretic domain, which leverages the notion of Information Content (IC), is presented. In particular, the proposed metric exploits the notion of intrinsic IC which quantifies IC values by scrutinizing how concepts are arranged in an ontological structure. In order to evaluate this metric, an on line experiment asking the community of researchers to rank a list of 65 word pairs has been conducted. The experiment's web setup allowed to collect 101 similarity ratings and to differentiate native and non-native English speakers. Such a large and diverse datasel enables to confidently evaluate similarity metrics by correlating them with human assessments. Experimental evaluations using WordNet indicate that the proposed metric, coupled with the notion of intrinsic IC, yields results above the state of the art. Moreover, the intrinsic IC formulation also improves the accuracy of other IC-based metrics. In order to investigate the generality of both the intrinsic IC formulation and proposed similarity metric a further evaluation using the MeSH biomedical ontology has been performed. Even in this case significant results were obtained. The proposed metric and several others have been implemented in the Java WordNet Similarity Library.
机译:在心理学,语言学,认知科学和人工智能等许多研究领域,计算单词之间的语义相似度是一个重要的问题。本文提出了一种新的语义相似度度量,该度量利用了基于特征的相似性理论的一些概念,并将其转换为利用信息内容(IC)概念的信息理论领域。特别地,所提出的度量利用了固有IC的概念,该概念通过仔细检查概念在本体结构中的排列方式来量化IC值。为了评估该指标,已进行了一项在线实验,要求研究人员对65个单词对进行排序。实验的网络设置允许收集101个相似度评分,并区分母语为英语和非母语的英语。如此庞大且多样化的数据集可通过将相似度指标与人工评估相关联来自信地评估相似度指标。使用WordNet进行的实验评估表明,所提出的度量标准与固有IC的概念相结合,得出的结果超出了现有技术水平。此外,固有的IC配方还提高了其他基于IC的度量的准确性。为了研究固有IC配方和拟议的相似性度量的通用性,已进行了使用MeSH生物医学本体的进一步评估。即使在这种情况下,也获得了明显的结果。拟议的度量标准和其他几个度量标准已在Java WordNet相似性库中实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号