首页> 外文期刊>Future Internet >Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet
【24h】

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

机译:使用余弦相似度的词义消歧与Word2vec和WordNet协作

获取原文
           

摘要

Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.
机译:取决于上下文,单词具有不同的含义(即感觉)。消除正确的歧义对于自然语言处理非常重要,也是一项艰巨的任务。直观的方法是在大型英语词汇数据库WordNet提供的上下文和意义定义之间选择最高的相似性。在这个数据库中,名词,动词,形容词和副词被分为通过概念语义和词典关系相互链接的认知同义词集。传统的无监督方法通过计算上下文和意义定义之间的重叠词(必须精确匹配)来计算相似度。应该通过在向量空间模型上表示上下文和意义定义,并使用潜在语义分析(LSA)分析它们之间的分布语义关系,基于单词之间的关系而不是重叠来计算相似度。当文本语料库变得更庞大时,LSA会消耗更多的内存,并且无法灵活地训练庞大的文本语料库。单词嵌入方法在此问题上具有优势。 Word2vec是一种流行的词嵌入方法,它通过跳跃语法或连续词袋(CBOW)模型在固定大小的矢量空间模型上表示词。与LSA相比,Word2vec还可以有效地从庞大的文本语料库中捕获语义和句法上的单词相似性。我们的方法使用Word2vec来构造上下文句子向量,然后意义定义向量使用余弦相似度计算每个句子向量之间的相似度,然后为每个单词意义赋予一个分数。感觉定义也随着从WordNet检索到的感觉关系而扩展。如果分数不高于特定阈值,则分数将与从较大的带有感觉标签的语料库SEMCOR中学习到的该感觉分布的概率相结合。可能的答案感可以从高分获得。我们的方法表明,结果(没有感官分布概率的50.9%或48.7%)高于基线(即原始,简化,改编和LSA Lesk),并且胜过许多参与SENSEVAL-3英语词汇样本的无监督系统任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号