Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

Korawit Orkphol; Wu Yang

首页> 外文期刊>Future Internet >Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

【24h】

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

机译：使用余弦相似度的词义消歧与Word2vec和WordNet协作

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

机译：取决于上下文，单词具有不同的含义（即感觉）。消除正确的歧义对于自然语言处理非常重要，也是一项艰巨的任务。直观的方法是在大型英语词汇数据库WordNet提供的上下文和意义定义之间选择最高的相似性。在这个数据库中，名词，动词，形容词和副词被分为通过概念语义和词典关系相互链接的认知同义词集。传统的无监督方法通过计算上下文和意义定义之间的重叠词（必须精确匹配）来计算相似度。应该通过在向量空间模型上表示上下文和意义定义，并使用潜在语义分析（LSA）分析它们之间的分布语义关系，基于单词之间的关系而不是重叠来计算相似度。当文本语料库变得更庞大时，LSA会消耗更多的内存，并且无法灵活地训练庞大的文本语料库。单词嵌入方法在此问题上具有优势。 Word2vec是一种流行的词嵌入方法，它通过跳跃语法或连续词袋（CBOW）模型在固定大小的矢量空间模型上表示词。与LSA相比，Word2vec还可以有效地从庞大的文本语料库中捕获语义和句法上的单词相似性。我们的方法使用Word2vec来构造上下文句子向量，然后意义定义向量使用余弦相似度计算每个句子向量之间的相似度，然后为每个单词意义赋予一个分数。感觉定义也随着从WordNet检索到的感觉关系而扩展。如果分数不高于特定阈值，则分数将与从较大的带有感觉标签的语料库SEMCOR中学习到的该感觉分布的概率相结合。可能的答案感可以从高分获得。我们的方法表明，结果（没有感官分布概率的50.9％或48.7％）高于基线（即原始，简化，改编和LSA Lesk），并且胜过许多参与SENSEVAL-3英语词汇样本的无监督系统任务。

著录项

来源
《Future Internet》 |2019年第5期|共16页
作者
Korawit Orkphol; Wu Yang;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类数学;
关键词
natural language processingword sense disambiguationword embeddingWord2vecWordNet;

机译：自然语言处理词义消歧词嵌入Word2vecWordNet;

相似文献

外文文献
中文文献
专利

1. Word Sense Disambiguation using Aggregated Similarity based on WordNet Graph Representation [J] . M?d?lina ZURINI Informatica Economica . 2013,第3期

机译：基于WordNet图表示的基于聚合相似度的词义消歧
2. Word Sense Disambiguation for Arabic Exploiting Arabic WordNet and Word Embedding [J] . Ali Alkhatlan, Jugal Kalita, Ahmed Alhaddad Procedia Computer Science . 2018,第22期

机译：阿拉伯语利用阿拉伯词网和词嵌入的词义消歧
3. Knowledge-Based Method for Word Sense Disambiguation by Using Hindi WordNet [J] . P. Sharma, N. Joshi Engineering Technology and Applied Science Research . 2019,第2期

机译：印地语WordNet的基于知识的词义消歧方法
4. Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets [C] . Dan TUFIS, Radii ION, Nancy IDE 20th International Conference on Computational Linguistics vol.2 . 2004

机译：基于并行语料库，词对齐，词聚类和对齐词网的细粒度词义消歧
5. Language Evolves, So Should WordNet: Automatically Extending WordNet with the Senses of Out of Vocabulary Lemmas [D] . Rusert, Jonathan. 2017

机译：语言演变，所以Wordnet应该自动扩展Wordnet与词汇lemmas的感官
6. Semantic relatedness and similarity of biomedical terms: examining the effects of recency size and section of biomedical publications on the performance of word2vec [O] . Yongjun Zhu, Erjia Yan, Fei Wang 2017

机译：生物医学术语的语义相关性和相似性：检查生物医学出版物的新近度大小和章节对word2vec性能的影响
7. Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets [O] . Tufis, Dan, Ion, Radu, Ide, Nancy 2005

机译：基于平行语料库，Word的细粒度词义消歧对齐，Word聚类和对齐的Wordnets
8. Word Domain Disambiguation via Word Sense Disambiguation [R] . Sanfilippo, A. 2006

机译：Word Word消歧通过Word sense消歧

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

摘要

著录项

相似文献

相关主题

期刊订阅