首页> 外文期刊>Computational linguistics >Co-occurrence retrieval: A flexible framework for lexical distributional similarity
【24h】

Co-occurrence retrieval: A flexible framework for lexical distributional similarity

机译:共现检索:用于词汇分布相似性的灵活框架

获取原文
获取原文并翻译 | 示例
       

摘要

Techniques that exploit knowledge of distributional similarity between words have been proposed in many areas of Natural Language Processing. For example, in language modeling, the sparse data problem can be alleviated by estimating the probabilities of unseen co-occurrences of events from the probabilities of seen co-occurrences of similar events. In other applications, distributional similarity is taken to be an approximation to semantic similarity. However, due to the wide range of potential applications and the lack of a strict definition of the concept of distributional similarity, many methods of calculating distributional similarity have been proposed or adopted. In this work, a flexible, parameterized framework for calculating distributional similarity is proposed. Within this framework, the problem of finding distributionally similar words is cast as one of co-occurrence retrieval (CR) for which precision and recall can be measured by analogy with the way they are measured in document retrieval. As will be shown, a number of popular existing measures of distributional similarity are simulated with parameter settings within the CR framework. In this article, the CR framework is then used to systematically investigate three fundamental questions concerning distributional similarity. First, is the relationship of lexical similarity necessarily symmetric, or are there advantages to be gained from considering it as an asymmetric relationship? Second, are some co-occurrences inherently more salient than others in the calculation of distributional similarity? Third, is it necessary to consider the difference in the extent to which each word occurs in each co-occurrence type? Two application-based tasks are used for evaluation: automatic thesaurus generation and pseudo-disambiguation. It is possible to achieve significantly better results on both these tasks by varying the parameters within the CR framework rather than using other existing distributional similarity measures; it will also be shown that any single unparameterized measure is unlikely to be able to do better on both tasks. This is due to an inherent asymmetry in lexical substitutability and therefore also in lexical distributional similarity.
机译:在自然语言处理的许多领域中已经提出了利用单词之间的分布相似性知识的技术。例如,在语言建模中,可以通过从类似事件的已见共现概率中估计事件中未见共现的概率来缓解稀疏数据问题。在其他应用程序中,分布相似性被认为是语义相似性的近似。但是,由于潜在的应用范围很广,并且缺乏对分布相似性概念的严格定义,已经提出或采用了许多计算分布相似性的方法。在这项工作中,提出了一种用于计算分布相似度的灵活,参数化的框架。在此框架内,发现分布相似的单词的问题被视为共现检索(CR)之一,对于共现检索(CR),可以通过类似于在文档检索中对其进行测量的方式来测量其准确性和召回率。如将显示的那样,使用CR框架内的参数设置来模拟许多流行的现有分布相似性度量。然后在本文中,使用CR框架系统地研究有关分布相似性的三个基本问题。首先,词汇相似性关系一定是对称的,还是将其视为非对称关系会获得好处?第二,在分布相似度的计算中,某些共生固有地比其他共生显着吗?第三,是否需要考虑每种共现类型中每个单词出现的程度不同?有两个基于应用程序的任务用于评估:自动同义词库生成和伪歧义消除。通过在CR框架内更改参数而不是使用其他现有的分布相似性度量,可以在这两项任务上取得明显更好的结果。还将显示,任何单个非参数化度量都不太可能在两个任务上都做得更好。这是由于词汇可替换性和词汇分布相似性中固有的不对称性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号