首页> 外文会议>Analysis of Images, social networks and texts >Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus
【24h】

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

机译:单词嵌入模型的训练语料库的大小与结构:Araneum Russicum Maximum和俄罗斯国家语料库

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version. Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for both models are described, showing that the RNC is generally more robust as training material for this task.
机译:在本文中,我们介绍了一种分布的词嵌入模型,该模型在最大的可用俄语语料库之一上进行了训练:Araneum Russicum Maximum(超过100亿个词从网络上爬取)。我们将此模型与在俄罗斯国家语料库(RNC)上训练的模型进行比较。两种语料库的大小和编译过程都大不相同。我们通过针对多语言SimLex999语义相似性数据集的俄语部分评估经过训练的模型来测试这些差异。我们检测并描述了该数据集中的许多问题,并发布了一个新的更正版本。除了已经知道的事实,即RNC通常比Web语料库更好的训练语料,我们列举并解释了模型如何处理语义相似性任务,评估集的哪些部分对于特定模型来说是困难的以及原因为何方面的细微差异。此外,还描述了两种模型的学习曲线,表明RNC作为此任务的培训材料通常更健壮。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号