【24h】

Using Similarity Measures to Select Pretraining Data for NER

机译:使用相似性度量选择NER的预训练数据

获取原文

摘要

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.
机译:在大量未标记数据上进行预训练的单词向量和语言模型(LM)可以显着改善各种自然语言处理(NLP)任务。但是,预训练数据和目标任务数据之间相似性的度量和影响尚需直觉。我们提出了三种具有成本效益的措施来量化源预训练和目标任务数据之间相似性的不同方面。我们证明这些措施可以很好地预测30多个数据对的预训练模型对命名实体识别(NER)的有用性。结果还表明,预训练的LM比预训练的词向量更有效,更可预测,但是当预训练的数据不相同时,预训练的词向量会更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号