首页> 外文期刊>BMC Medical Informatics and Decision Making >Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
【24h】

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

机译:生物医学术语的语义相关性和相似性:检查生物医学出版物的新近度,大小和章节对word2vec性能的影响

获取原文
       

摘要

Background Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. Methods We download s of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. Results Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on s produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). Conclusions Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, s excel in accuracy but lose in coverage of identifiable relations.
机译:背景技术了解生物医学术语之间的语义相关性和相似性对诸如生物医学信息检索,信息提取和推荐系统之类的各种应用程序具有重大影响。这项研究的目的是检验word2vec从大量出版物数据中得出生物医学术语之间的语义相关性和相似性的能力。具体来说,我们重点关注新近度,大小和生物医学出版物数据的部分对word2vec性能的影响。方法我们从PubMed下载了18,777,129篇文章,从PubMed Central(PMC)下载了766,326篇全文文章。数据集经过预处理,并按新近度,大小和部分分组。在这些子测试中训练Word2vec模型。从word2vec模型获得的生物医学术语之间的余弦相似度与参考标准进行了比较。比较在不同子集上训练的模型的性能,以检查新近度,大小和截面效果。结果在最近的数据集上训练的模型并没有提高性能。在相关性任务(从10%的368到100%的494)和相似性任务(从10%的374到491的491)下,在较大数据集上训练的模型比在较小数据集上训练的模型识别出更多的生物医学术语对。 100%级别)。在上训练的模型产生的结果与参考标准的相关性高于在文章主体上训练的模型(即,相似性任务中的0.65对0.62,关联性任务中的0.66对0.59)。但是,后者比前者识别出更多的生物医学术语对(即在相似性任务中为344对498,在相似性任务中为339对503)。结论增加数据集的大小并不总是可以提高性能。增加数据集的大小可能会导致识别更多生物医学术语关系,即使这不能保证更高的精度。作为研究文章的摘要,与文章正文相比,s的准确性较高,但无法涵盖可识别的关系。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号