首页> 中文期刊>中文信息学报 >一种基于搭配的中文词汇语义相似度计算方法

一种基于搭配的中文词汇语义相似度计算方法

     

摘要

The word similarity measure plays a basic role in many NLP related applications. In this paper, we propose a novel and practical method for this purpose with acceptable precision. Guided by the classic distribution hypothesis that "similar words occur in similar contexts", we suggest the collocations in two-word noun phrases can serve as better contexts than the adjacent words because the former are more semantic related. By using automatic built large-scale noun phrases, we firstly construct tf-idf weighted words vectors containing direct and indirect collocations, and then take their cosine distances as desired semantic similarities. In order to compare with related approaches, we manually design a benchmark test set. On the benchmark test set, the proposed method achieves the correlation coefficients of 0. 703, 0. 509, and 0. 700 on nouns, verbs, and adjectives, respectively, at a coverage 100%.%词汇间的语义相似度计算在自然语言处理相关的许多应用中有基础作用.该文提出了一种新的计算方法,具有高效实用、准确率较高的特点.该方法从传统的分布相似度假设“相似的词汇出现在相似的上下文中”出发,提出不再采用词汇在句子中的邻接词,而是采用词汇在二词名词短语中的搭配词作为其上下文,将更能体现词汇的语义特征,可取得更好的计算结果.在自动构建大规模二词名词短语的基础上,首先基于tf-idf构造直接和间接搭配词向量,然后通过计算搭配词向量间的余弦距离得到词汇间的语义相似度.为了便于与相关方法比较,构建了基于人工评分的中文词汇语义相似度基准测试集,在该测试集中的名、动、形容词中,方法分别得到了0.703、0.509、0.700的相关系数,及100%的覆盖率.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号