首页> 外文OA文献 >Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records
【2h】

Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records

机译:与生物医学Corpora预先培训的句子嵌入的深度学习提高了在电子病历中找到类似句子的表现

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Abstract Background Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. Methods We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. Results The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. Conclusions Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
机译:摘要背景捕获句子语义在一系列文本挖掘应用中起着重要作用。尽管对常规域中的相关数据集和模型的开发不断努力,但是在生物医学和临床结构域中的数据集和模型都是有限的。 Biocreative / OHNLP2018组织者首次尝试从临床笔记中注释1068句对,并要求社区努力解决语义文本相似性(Biocreative / OHNLP STS)挑战。方法采用传统机器学习和深度学习方法开发了模型。对于挑战,我们专注于两种型号:随机林和编码器网络。我们应用了在PubMed摘要和模拟 - III临床笔记上进行预先培训的句子嵌入,并相应地更新了随机森林和编码器网络。结果官方结果表明我们最好的提交是八种模型的集合。它达到了0.8328的人相关系数 - 来自4支球队的13个提交的最高性能。对于挑战后,改善了随机林和编码器网络的性能;特别地,编码器网络的相关性得到了〜13%的提高。在挑战任务期间,没有最终的深度学习模型比采取手动制作功能的机器学习模型具有更好的性能。相比之下,与在生物医学Corpora上预先培训的句子嵌入,编码器网络现在实现了〜0.84的相关性,其高于原始最佳模型。该集合模型采用随机林和编码器网络的改进版本,作为输入进一步提高了0.8528的性能。结论深入学习模型与生物医学集团预培训的句子嵌入式达到测试集的最高性能。通过误差分析,我们发现,通过查找不同类型的句子,我们的端到端深入学习模型和传统机器学习模型与手动制作的功能相互补充。我们建议这些模型的组合可以在实践中更好地找到类似的句子。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号