首页> 外文会议>Workshop on cognitive aspects of the lexicon >Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off
【24h】

Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off

机译:建模词含义:分布语义和语料库质量数量折磨

获取原文

摘要

Dictionaries constructed using distributional models of lexical semantics have a wide range of applications in NLP and in the modeling of linguistic cognition. However when constructing such a model, we are faced with range of corpora to choose from. Often there is a choice between small carefully constructed corpora of well-edited text, and very large heterogeneous collections harvested automatically from the web. There may also be differences in the distribution of genres and registers in such corpora. In this paper we examine these trade-offs by constructing a simple SVD-reduced word-collocate model, using four English corpora: the Google Web 5-gram collection, the Google Book 5-gram collection, the English Wikipedia, and collection of short social messages harvested from Twitter. Since these models need to encode semantics in a way that approximates the mental lexicon, we evaluate the felicity of the resulting semantic representations using a set of behavioral and neural-activity benchmarks that depend on word-similarity. We find that the quality of the input text has a very strong effect on the performance of the output model, and that a corpus of high quality at a small size can outperform a corpus of poor quality that is many orders of magnitude larger. We also explore the semantic closeness of the models using their mutual information overlap to interpret the similarity of corpus texts.
机译:使用词汇语义的分配模型构造的词典在NLP中具有广泛的应用以及语言认知的建模。然而,在构建这样的模型时,我们面临着Corpora的范围可供选择。通常在编辑良好的文本中的小精心构建的小型中有一种选择,并且非常大的异构集合自动从网上收获。在此类公司中,流派和寄存器的分布也可能存在差异。在本文中,我们通过构建一个简单的SVD减少的单词扩展模型来检查这些权衡,使用四个英语语料库:Google Web 5-Gram系列,谷歌书5-Gram系列,英国维基百科和短暂的集合从Twitter收获的社交信息。由于这些模型需要以近似于心理lexicon的方式编码语义,因此使用依赖于单词相似性的一组行为和神经活动基准来评估所产生的语义表示的趣味性。我们发现输入文本的质量对输出模型的性能产生了非常强烈的影响,并且在小尺寸下的高质量件可以优于质量差的态度,这些数量率较大。我们还使用其互信息重叠探索模型的语义闭合,以解释语料库文本的相似性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号