首页> 外文会议>International Conference on Language Resources and Evaluation >Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
【24h】

Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

机译:在下游任务和语义类比中嵌入评估

获取原文
获取外文期刊封面目录资料

摘要

Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality.
机译:语言模型长期以来一直是自然语言处理领域的多产的研究领域(NLP)。其中一个较新的语言模型和一些最常用的语言模型(我们)是Word Embeddings(我们)。我们是由非监督神经网络基于出现单词的上下文学习的词汇表的矢量空间表示。我们已广泛应用于NLP的许多研究领域的下游任务。这些区域通常使用这些向量模型作为文本数据处理中的特征。本文介绍了对葡萄牙语的新发布模型的评估,培训了由49亿令牌组成的语料库。第一个评估介绍了一个内在的任务,其中WES必须正确构建语义和句法关系。第二个评估介绍了一个外在任务,其中我们模型用于两个下游任务:命名实体识别和句子之间的语义相似性。我们的研究结果表明,多样化和全面的语料库通常优于更大,较为娇小的不同语料库,并且将文本传递给我们生成算法的零件可能导致质量损失。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号