首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Unsupervised Semantic Similarity Computation between Terms Using Web Documents
【24h】

Unsupervised Semantic Similarity Computation between Terms Using Web Documents

机译:使用Web文档的术语之间的无监督语义相似度计算

获取原文
获取原文并翻译 | 示例

摘要

In this work, Web-based metrics that compute the semantic similarity between words or terms are presented and compared with the state of the art. Starting from the fundamental assumption that similarity of context implies similarity of meaning, relevant Web documents are downloaded via a Web search engine and the contextual information of words of interest is compared (context-based similarity metrics). The proposed algorithms work automatically, do not require any human-annotated knowledge resources, e.g., ontologies, and can be generalized and applied to different languages. Context-based metrics are evaluated both on the Charles-Miller data set and on a medical term data set. It is shown that context-based similarity metrics significantly outperform co-occurrence-based metrics, in terms of correlation with human judgment, for both tasks. In addition, the proposed unsupervised context-based similarity computation algorithms are shown to be competitive with the state-of-the-art supervised semantic similarity algorithms that employ language-specific knowledge resources. Specifically, context-based metrics achieve correlation scores of up to 0.88 and 0.74 for the Charles-Miller and medical data sets, respectively. The effect of stop word filtering is also investigated for word and term similarity computation. Finally, the performance of context-based term similarity metrics is evaluated as a function of the number of Web documents used and for various feature weighting schemes.
机译:在这项工作中,提出了基于Web的度量标准,用于计算单词或术语之间的语义相似性,并将其与现有技术进行比较。从上下文相似意味着含义相似的基本假设开始,通过Web搜索引擎下载相关的Web文档,并比较感兴趣单词的上下文信息(基于上下文的相似性度量)。所提出的算法自动工作,不需要任何人类注释的知识资源,例如本体,并且可以被概括并应用于不同的语言。在Charles-Miller数据集和医学术语数据集上都评估了基于上下文的度量。结果表明,就与人类判断的相关性而言,对于这两个任务,基于上下文的相似性度量显着优于基于共现的度量。另外,所提出的无监督的基于上下文的相似度计算算法显示出与采用特定于语言的知识资源的最新的有监督的语义相似度算法竞争。具体而言,基于上下文的度量标准对于Charles-Miller和医学数据集分别实现高达0.88和0.74的相关评分。还研究了停用词过滤对词和词相似度计算的影响。最后,基于上下文的术语相似性度量的性能将根据所使用的Web文档数量以及各种功能加权方案进行评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号