首页> 美国卫生研究院文献>other >Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
【2h】

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

机译:深层蛋白质组学和基因组学生物学序列的连续分布式表示

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: and Harvard Dataverse: .
机译:我们介绍了一种生物序列的新表示和特征提取方法。命名为生物载体(BioVec)通常是指生物序列,其中蛋白质(氨基酸序列)为蛋白质载体(ProtVec),基因序列为基因载体(GeneVec),这种表示形式可广泛用于深度应用学习蛋白质组学和基因组学。在本文中,我们重点研究可用于各种生物信息学研究的蛋白质载体,例如家族分类,蛋白质可视化,结构预测,无序蛋白质鉴定和蛋白质-蛋白质相互作用预测。在这种方法中,我们采用人工神经网络方法,并用单个密集的n维向量表示蛋白质序列。为了评估该方法,我们将其应用于对来自Swiss-Prot的7,027个蛋白质家族的324,018个蛋白质序列进行分类,该分类序列的平均家族分类准确度达到93%±0.06%,优于现有的家族分类方法。此外,我们使用ProtVec表示法来预测结构化蛋白质中的无序蛋白质。使用了两个无序序列数据库:DisProt数据库以及以富含苯丙氨酸-甘氨酸重复序列(FG-Nups)的核孔蛋白无序区域为特征的数据库。使用支持向量机分类器,FG-Nup序列与在蛋白质数据库(PDB)中发现的结构化蛋白质序列的区分度为99.8%,非结构化DisProt序列与结构化DisProt序列的区分度为100.0%。这些结果表明,通过仅向该模型提供各种蛋白质的序列数据,就可以确定有关蛋白质结构的准确信息。重要的是,该模型只需要训练一次,然后可以用于提取有关目标蛋白质的全面信息。此外,可以将这种表示形式视为生物信息学中深度学习的各种应用的预训练。相关数据可在“生活语言处理网站”和“哈佛数据verse”上找到。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号