首页> 外文会议>Annual meeting of the Association for Computational Linguistics;Workshop on biomedical natural language processing >Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization
【24h】

Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization

机译:预训练生物医学单词表示中的子单词信息:评估和超参数优化

获取原文

摘要

Word2vec embeddings are limited to computing vectors for in-vocabulary terms and do not take into account sub-word information. Character-based representations. such as fastText, mitigate such limitations. We optimize and compare these representations for the biomedical domain. fastText was found to consistently outperform word2vec in named entity recognition tasks for entities such as chemicals and genes. This is likely due to gained information from computed out-of-vocabulary term vectors, as well as the word compositionality of such entities. Contrastingly, performance varied on intrinsic datasets. Optimal hyper-parameters were intrinsic dataset-dependent, likely due to differences in term types distributions. This indicates embeddings should be chosen based on the task at hand. We therefore provide a number of optimized hyper-parameter sets and pre-trained word2vec and fastText models, available on https://github.com/dterg/bionlp-embed.
机译:Word2vec嵌入仅限于词汇中的计算向量,并且不考虑子词信息。基于字符的表示形式。例如fastText,减轻这种限制。我们针对生物医学领域优化并比较了这些表示形式。发现fastText在化学和基因等实体的命名实体识别任务中始终胜过word2vec。这可能是由于从计算出的词汇外词汇向量中获得的信息以及此类实体的单词组成。相反,性能在固有数据集上有所不同。最佳超参数取决于数据集的内在性质,这可能是由于术语类型分布的差异所致。这表明应根据手头的任务选择嵌入。因此,我们提供了许多优化的超参数集以及预训练的word2vec和fastText模型,可从https://github.com/dterg/bionlp-embed获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号