首页> 外文OA文献 >Evaluating distributional models of compositional semantics
【2h】

Evaluating distributional models of compositional semantics

机译:评估组合语义的分布模型

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Distributional models (DMs) are a family of unsupervised algorithms that represent the meaning of words as vectors. They have been shown to capture interesting aspects of semantics. Recent work has sought to compose word vectors in order to model phrases and sentences. The most commonly used measure of a compositional DM’s performance to date has been the degree to which it agrees with human-provided phrase similarity scores.udThe contributions of this thesis are three-fold. First, I argue that existing intrinsic evaluations are unreliable as they make use of small and subjective gold-standard data sets and assume a notion of similarity that is independent of a particular application. Therefore, they do not necessarily measure how well a model performs in practice. I study four commonly used intrinsic datasets and demonstrate that all of them exhibit undesirable properties.udSecond, I propose a novel framework within which to compare word- or phrase-level DMs in terms of their ability to support document classification. My approach couples a classifier to a DM and provides a setting where classification performance is sensitive to the quality of the DM.udThird, I present an empirical evaluation of several methods for building word representations and composing them within my framework. I find that the determining factor in building word representations is data quality rather than quantity; in some cases only a small amount of unlabelled data is required to reach peak performance. Neural algorithms for building single-word representations perform better than counting-based ones regardless of what composition is used, but simple composition algorithms can outperform more sophisticated competitors. Finally, I introduce a new algorithm for improving the quality of distributional thesauri using information from repeated runs of the same non deterministic algorithm.
机译:分布模型(DM)是一类无监督的算法,这些算法将单词的含义表示为矢量。已经证明它们捕获了语义的有趣方面。最近的工作试图组成词向量以对短语和句子建模。迄今为止,合成DM性能最常用的衡量标准是它与人类提供的词组相似性评分相符的程度。 ud本论文的贡献是三方面的。首先,我认为现有的内在评估不可靠,因为它们利用小的主观金标准数据集并假设相似性概念独立于特定应用程序。因此,它们不一定测量模型在实践中的性能。我研究了四个常用的内在数据集,并证明它们都表现出不良的特性。 ud其次,我提出了一个新颖的框架,可以在此框架内比较单词或短语级DM在支持文档分类方面的能力。我的方法将分类器与DM耦合,并提供分类性能对DM质量敏感的设置。 ud第三,我对构建单词表示并将其组成在我的框架中的几种方法进行了实证评估。我发现构建单词表示形式的决定性因素是数据质量而不是数量。在某些情况下,只需要少量未标记的数据即可达到最佳性能。无论使用哪种构图,用于构建单个单词表示的神经算法的性能都比基于计数的算法更好,但是简单的构图算法可以胜过更复杂的竞争对手。最后,我介绍了一种新算法,该算法使用来自相同非确定性算法的重复运行的信息来提高分布式叙词表的质量。

著录项

  • 作者

    Batchkarov Miroslav Manov;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号