首页> 美国卫生研究院文献>PLoS Computational Biology >Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
【2h】

Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine

机译:用于数据库管理和精准医学的生物医学文献中的文本挖掘基因型与表型关系

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships.
机译:精确医学的实践最终将需要基因和突变数据库供医疗保健提供者参考,以便了解每位患者遗传组成的临床意义。尽管最高质量的数据库需要手动管理,但文本挖掘工具可以简化管理过程,提高准确性,覆盖范围和生产率。但是,迄今为止,尚无可用的文本挖掘工具能够为从生物医学文献中提取此类三胞胎提供高精度的性能。在本文中,我们提出了一种高性能的机器学习方法,可以自动从生物医学文献中提取疾病基因变异三联体。我们的方法之所以独特,是因为我们不仅从本地文本内容中,而且从全球环境中(从Internet和PubMed中的所有文献中)识别与每个突变相关的基因和蛋白质产物。我们的方法还使用一种新的基于文本挖掘的机器学习方法来结合蛋白质序列验证和疾病关联。我们从PubMed的所有摘要中提取与一组十种重要疾病(乳腺癌,前列腺癌,胰腺癌,肺癌,急性髓细胞性白血病,阿尔茨海默氏病,血色素沉着病,年龄相关性黄斑变性(AMD))相关的疾病基因变异三联体),糖尿病和囊性纤维化)。然后,我们通过两种方式评估我们的方法:(1)使用基准数据集与现有技术进行直接比较; (2)一项验证性研究,将我们的方法的结果与流行的人类治愈数据库(UniProt)中每种前述疾病的条目进行比较。在基准测试比较中,我们的完整方法使F1量度比最新结果提高了28%(从0.62到0.79)。对于使用UniProt知识库(KB)进行的验证研究,我们对结果和错误进行了全面分析。在所有疾病中,我们的方法均返回了272个三联体(疾病基因变异),它们与UniProt中的条目重叠,而5,384个三联体中的UniProt中没有重叠。对重叠的三胞胎和不重叠的三胞胎的分层样本进行的分析显示,相应类别的准确性为93%和80%(累积准确度为77%)。我们得出结论,我们的过程代表了对疾病-基因-变异关系的治疗的最新发展的重要且广泛适用的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号