Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off

机译：建模词含义：分布语义和语料库质量数量折磨

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Dictionaries constructed using distributional models of lexical semantics have a wide range of applications in NLP and in the modeling of linguistic cognition. However when constructing such a model, we are faced with range of corpora to choose from. Often there is a choice between small carefully constructed corpora of well-edited text, and very large heterogeneous collections harvested automatically from the web. There may also be differences in the distribution of genres and registers in such corpora. In this paper we examine these trade-offs by constructing a simple SVD-reduced word-collocate model, using four English corpora: the Google Web 5-gram collection, the Google Book 5-gram collection, the English Wikipedia, and collection of short social messages harvested from Twitter. Since these models need to encode semantics in a way that approximates the mental lexicon, we evaluate the felicity of the resulting semantic representations using a set of behavioral and neural-activity benchmarks that depend on word-similarity. We find that the quality of the input text has a very strong effect on the performance of the output model, and that a corpus of high quality at a small size can outperform a corpus of poor quality that is many orders of magnitude larger. We also explore the semantic closeness of the models using their mutual information overlap to interpret the similarity of corpus texts.

机译：使用词汇语义的分配模型构造的词典在NLP中具有广泛的应用以及语言认知的建模。然而，在构建这样的模型时，我们面临着Corpora的范围可供选择。通常在编辑良好的文本中的小精心构建的小型中有一种选择，并且非常大的异构集合自动从网上收获。在此类公司中，流派和寄存器的分布也可能存在差异。在本文中，我们通过构建一个简单的SVD减少的单词扩展模型来检查这些权衡，使用四个英语语料库：Google Web 5-Gram系列，谷歌书5-Gram系列，英国维基百科和短暂的集合从Twitter收获的社交信息。由于这些模型需要以近似于心理lexicon的方式编码语义，因此使用依赖于单词相似性的一组行为和神经活动基准来评估所产生的语义表示的趣味性。我们发现输入文本的质量对输出模型的性能产生了非常强烈的影响，并且在小尺寸下的高质量件可以优于质量差的态度，这些数量率较大。我们还使用其互信息重叠探索模型的语义闭合，以解释语料库文本的相似性。

著录项

来源
《Workshop on cognitive aspects of the lexicon》|2012年||共16页
会议地点
作者
Seshadri Sridharan; Brian Murphy;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序语言、算法语言;
关键词
Vector space models; distributional semantics; corpus size; corpus genre; corpus quality; neurosemantics; word similarity;

机译：矢量空间模型;分布语义;语料库尺寸;语料库类型;语料库质量;神经血型;词相似度;
入库时间 2022-08-20 22:02:28

相似文献

外文文献
中文文献
专利

1. When Meaning Is Not Enough: Distributional and Semantic Cues to Word Categorization in Child Directed Speech [J] . Sara Feijoo, Carmen Mu?±oz, Anna Amad?3, Frontiers in Psychology . 2017,第1期

机译：当意义不够时：面向儿童的语音中单词分类的分布和语义提示
2. Semantic Memory Search and Retrieval in a Novel Cooperative Word Game: A Comparison of Associative and Distributional Semantic Models [J] . Kumar Abhilasha A., Steyvers Mark, Balota David A. Cognitive science . 2021,第10期

机译：新颖的合作词游戏中的语义记忆搜索和检索：联想和分布语义模型的比较
3. Corpus domain effects on distributional semantic modeling of medical terms [J] . Bioinformatics . 2016,第23期

机译：语料库域对医学术语分布语义建模的影响
4. Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off [C] . Seshadri Sridharan, Brian Murphy Workshop on cognitive aspects of the lexicon;International conference on computational linguistics . 2012

机译：建模词义：分布语义学与语料库质量-数量权衡
5. A machine-aided approach to intelligent index generation: Using natural language processing and latent semantic analysis to determine the contexts and relationships among words in a corpus. [D] . Lukon, Shelly Candita. 2006

机译：一种机器辅助的智能索引生成方法：使用自然语言处理和潜在语义分析来确定语料库中单词之间的上下文和关系。
6. When Meaning Is Not Enough: Distributional and Semantic Cues to Word Categorization in Child Directed Speech [O] . Sara Feijoo, Carmen Muñoz, Anna Amadó, -1

机译：当意义不够时：面向儿童的语音中单词分类的分布和语义提示
7. Utilizing Semantic Composition in Distributional Semantic Models for Word Sense Discrimination and Word Sense Disambiguation [O] . Cem Akkaya, Janyce Wiebe, Rada Mihalcea 2013

机译：利用分布式语义模型中的语义构成进行词义识别和词义消歧

Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off

摘要

著录项

相似文献

相关主题

期刊订阅