Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization

机译：预训练生物医学单词表示中的子单词信息：评估和超参数优化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word2vec embeddings are limited to computing vectors for in-vocabulary terms and do not take into account sub-word information. Character-based representations. such as fastText, mitigate such limitations. We optimize and compare these representations for the biomedical domain. fastText was found to consistently outperform word2vec in named entity recognition tasks for entities such as chemicals and genes. This is likely due to gained information from computed out-of-vocabulary term vectors, as well as the word compositionality of such entities. Contrastingly, performance varied on intrinsic datasets. Optimal hyper-parameters were intrinsic dataset-dependent, likely due to differences in term types distributions. This indicates embeddings should be chosen based on the task at hand. We therefore provide a number of optimized hyper-parameter sets and pre-trained word2vec and fastText models, available on https://github.com/dterg/bionlp-embed.

机译：Word2vec嵌入仅限于词汇中的计算向量，并且不考虑子词信息。基于字符的表示形式。例如fastText，减轻这种限制。我们针对生物医学领域优化并比较了这些表示形式。发现fastText在化学和基因等实体的命名实体识别任务中始终胜过word2vec。这可能是由于从计算出的词汇外词汇向量中获得的信息以及此类实体的单词组成。相反，性能在固有数据集上有所不同。最佳超参数取决于数据集的内在性质，这可能是由于术语类型分布的差异所致。这表明应根据手头的任务选择嵌入。因此，我们提供了许多优化的超参数集以及预训练的word2vec和fastText模型，可从https://github.com/dterg/bionlp-embed获得。

著录项

来源
《Annual meeting of the Association for Computational Linguistics;Workshop on biomedical natural language processing》|2018年|56-66|共11页
会议地点
作者
Dieter Galea; Ivan Laponogov; Kirill Veselkov;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A document representation framework with interpretable features using pre-trained word embeddings [J] . Narendra Babu Unnam, P. Krishna Reddy International Journal of Data Science and Analytics . 2020,第1期

机译：使用预先训练的Word Embeddings具有可解释功能的文档表示框架
2. Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks [J] . Buzhou Tang, Hongxin Cao, Xiaolong Wang, BioMed research international . 2014,第8期

机译：在生物医学命名实体识别任务中评估单词表示功能
3. Decoding with sub-word network models for out-of-vocabulary words recognition [J] . Hiroaki Kokubo, Shigehiko Onishi, Hirofumi Yamamoto, 電子情報通信学会技術研究報告. 音声. Speech . 2001,第156期

机译：利用子词网络模型进行解码，以识别词汇外的词
4. Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization [C] . Dieter Galea, Ivan Laponogov, Kirill Veselkov Annual meeting of the Association for Computational Linguistics . 2018

机译：预训练的生物医学字表示中的子字信息：评估和超参数优化
5. Learning sub-word units and exploiting contextual information for open vocabulary speech recognition. [D] . Parada, Maria Carolina. 2011

机译：学习子词单位并利用上下文信息进行开放式词汇语音识别。
6. Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks [O] . Buzhou Tang, Hongxin Cao, Xiaolong Wang, -1

机译：在生物医学命名实体识别任务中评估单词表示功能
7. Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization [O] . Dieter Galea, Ivan Laponogov, Kirill Veselkov 2018

机译：预训练的生物医学字表示中的子字信息：评估和超参数优化

Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization

摘要

著录项

相似文献

相关主题

期刊订阅