Using Similarity Measures to Select Pretraining Data for NER

机译：使用相似性度量选择NER的预训练数据

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.

机译：在大量未标记数据上进行预训练的单词向量和语言模型（LM）可以显着改善各种自然语言处理（NLP）任务。但是，预训练数据和目标任务数据之间相似性的度量和影响尚需直觉。我们提出了三种具有成本效益的措施来量化源预训练和目标任务数据之间相似性的不同方面。我们证明这些措施可以很好地预测30多个数据对的预训练模型对命名实体识别（NER）的有用性。结果还表明，预训练的LM比预训练的词向量更有效，更可预测，但是当预训练的数据不相同时，预训练的词向量会更好。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2019年|1460-1470|共11页
会议地点
作者
Xiang Dai; Sarvnaz Karimi; Ben Hachey; Cecile Paris;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Selecting a semantic similarity measure for concepts in two different CAD model data ontologies [J] . Wenlong Lu, Yuchu Qin, Qunfen Qi, Advanced engineering informatics . 2016,第3期

机译：为两个不同的CAD模型数据本体中的概念选择语义相似性度量
2. Selecting Multiview Point Similarity from Different Methods of Similarity Measure to Perform Document Comparison [J] . S. Kalpana, S. Vigneshwari Indian Journal of Science and Technology . 2016,第10期

机译：从不同的相似性度量方法中选择多视点相似性以进行文档比较
3. Attribute selection using fuzzy roughset based customized similarity measure for lung cancer microarray gene expression data [J] . C. Arunkumar, S. Ramakrishnan Future Computing and Informatics Journal . 2018,第1期

机译：基于模糊粗糙集的定制相似性量度用于肺癌微阵列基因表达数据的属性选择
4. Using Similarity Measures to Select Pretraining Data for NER [C] . Xiang Dai, Sarvnaz Karimi, Ben Hachey, Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2019

机译：使用相似措施选择NER的预先预订数据
5. Similarity Measures and Anomaly Detection for Mixed Data [D] . Davidow, Matthew Brody. 2020

机译：相似度量和混合数据的异常检测
6. Gene selection and classification for cancer microarray data based on machine learning and similarity measures [O] . Qingzhong Liu, Andrew H Sung, Zhongxue Chen, 2011

机译：基于机器学习和相似性度量的癌症微阵列数据的基因选择和分类
7. Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media [O] . Xiang Dai, Sarvnaz Karimi, Ben Hachey, 2020

机译：经济型预测数据的成本效益：在社交媒体上预先曝光伯特的案例研究
8. Quantifying Similarity and Distance Measures for Vector-Based Datasets: Histograms, Signals, and Probability Distribution Functions. [R] . Tschopp, M. A., Hernandez-Rivera, E. 2017

机译：量化基于矢量的数据集的相似性和距离度量：直方图，信号和概率分布函数。

Using Similarity Measures to Select Pretraining Data for NER

摘要

著录项

相似文献

相关主题

期刊订阅