首页> 外文OA文献 >Corpus Specificity in LSA and Word2vec: The Role of Out-of-Domain Documents
【2h】

Corpus Specificity in LSA and Word2vec: The Role of Out-of-Domain Documents

机译:LSA和Word2Vec中的语料库特异性:域外文件的作用

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Latent Semantic Analysis (LSA) and Word2vec are some of the most widely usedword embeddings. Despite the popularity of these techniques, the precisemechanisms by which they acquire new semantic relations between words remainunclear. In the present article we investigate whether LSA and Word2veccapacity to identify relevant semantic dimensions increases with size ofcorpus. One intuitive hypothesis is that the capacity to identify relevantdimensions should increase as the amount of data increases. However, if corpussize grow in topics which are not specific to the domain of interest, signal tonoise ratio may weaken. Here we set to examine and distinguish thesealternative hypothesis. To investigate the effect of corpus specificity andsize in word-embeddings we study two ways for progressive elimination ofdocuments: the elimination of random documents vs. the elimination of documentsunrelated to a specific task. We show that Word2vec can take advantage of allthe documents, obtaining its best performance when it is trained with the wholecorpus. On the contrary, the specialization (removal of out-of-domaindocuments) of the training corpus, accompanied by a decrease of dimensionality,can increase LSA word-representation quality while speeding up the processingtime. Furthermore, we show that the specialization without the decrease in LSAdimensionality can produce a strong performance reduction in specific tasks.From a cognitive-modeling point of view, we point out that LSA's word-knowledgeacquisitions may not be efficiently exploiting higher-order co-occurrences andglobal relations, whereas Word2vec does.
机译:潜在语义分析(LSA)和Word2VEC是一些最广泛使用的嵌入品。尽管这些技术的受欢迎程度,但它们在仍然持续的单词之间获得新的语义关系的先决力机制。在本文中,我们研究了LSA和Word2传递是否识别相关的语义尺寸随着Corpus的大小而增加。一个直观的假设是识别QuestiveDimensions的能力应随着数据量增加而增加。但是,如果有核查在没有感兴趣的领域没有特定的主题中生长,则信号儿童比率可能会削弱。在这里,我们设立了检查和区分表达的假设。为了探讨语料库特异性和大小在嵌入中的效果,我们研究了两种渐进消除方式:消除随机文件与消除文件的消除到特定任务。我们表明Word2VEC可以利用所有文件,当它用WhoLecorpus培训时获得最佳性能。相反,培训语料库的专业(移除域外数据库)伴随着维度的减少,可以在加速处理时间时增加LSA字形质量。此外,我们表明,没有Lsadiminality降低的专业化可以产生特定任务的强大性能降低。从认知建模的角度来看,我们指出了LSA的话语知识,可能无法有效利用更高阶的共同发生andglobal关系,而Word2Vec则。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号