首页> 外文会议>Annual International Conference of the IEEE Engineering in Medicine and Biology Society >Predicting the pathogenicity of protein coding mutations using Natural Language Processing
【24h】

Predicting the pathogenicity of protein coding mutations using Natural Language Processing

机译:使用自然语言处理预测蛋白质编码突变的致病性

获取原文

摘要

DNA-Sequencing of tumor cells has revealed thousands of genetic mutations. However, cancer is caused by only some of them. Identifying mutations that contribute to tumor growth from neutral ones is extremely challenging and is currently carried out manually. This manual annotation is very cumbersome and expensive in terms of time and money. In this study, we introduce a novel method "NLP-SNPPred" to read scientific literature and learn the implicit features that cause certain variations to be pathogenic. Precisely, our method ingests the bio-medical literature and produces its vector representation via exploiting state of the art NLP methods like sent2vec, word2vec and tf-idf. These representations are then fed to machine learning predictors to identify the pathogenic versus neutral variations. Our best model (NLPSNPPred) trained on OncoKB and evaluated on several publicly available benchmark datasets, outperformed state of the art function prediction methods. Our results show that NLP can be used effectively in predicting functional impact of protein coding variations with minimal complementary biological features. Moreover, encoding biological knowledge into the right representations, combined with machine learning methods can help in automating manual efforts. A free to use web-server is available at http://www.nlp-snppred.cbrlab.org
机译:肿瘤细胞的DNA测序揭示了成千上万的遗传突变。但是,癌症仅由其中一些引起。从中性突变中识别有助于肿瘤生长的突变是极具挑战性的,目前正在手动进行。该手动注释在时间和金钱上非常麻烦且昂贵。在这项研究中,我们引入了一种新颖的方法“ NLP-SNPPred”来阅读科学文献并了解导致某些变异是致病性的隐性特征。准确地说,我们的方法吸收了生物医学文献,并通过利用最先进的NLP方法(如send2vec,word2vec和tf-idf)来产生其载体表示。然后,将这些表示提供给机器学习预测器,以识别病原性变化与中性变化。我们的最佳模型(NLPSNPPred)在OncoKB上进行了训练,并在几个公开的基准数据集上进行了评估,其性能优于最新的功能预测方法。我们的结果表明,NLP可有效地用于预测具有最小互补生物学特征的蛋白质编码变异的功能影响。此外,将生物学知识编码为正确的表示形式,再结合机器学习方法,可以帮助实现手动操作的自动化。可免费使用的网络服务器,网址为http://www.nlp-snppred.cbrlab.org。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号