...
首页> 外文期刊>PLoS Genetics >Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines
【24h】

Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines

机译:通过支持向量机区分非编码RNA中的蛋白质编码

获取原文
   

获取外文期刊封面封底 >>

       

摘要

RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for “coding or non-coding”), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.
机译:RIKEN的FANTOM项目揭示了许多以前未知的编码序列,以及由替代性启动子使用和剪接导致的转录本出乎意料的变化程度。通常,通过转录组研究已经鉴定出了更多不编码蛋白质的转录本。越来越多的证据表明这种非编码RNA(ncRNA)的重要细胞作用。因此,蛋白质编码RNA转录物与ncRNA转录物的区别是理解转录组和进行注释的重要问题。很少有计算机方法专门解决此问题。在这里,我们介绍CONC(用于“编码或非编码”),这是一种基于支持向量机的新颖方法,该方法根据转录物编码蛋白质时所具有的特征对转录物进行分类。这些特征包括肽长度,氨基酸组成,预测的二级结构含量,预测的暴露残基百分比,组成熵,数据库搜索中的同系物数量和比对熵。核苷酸频率也被并入该方法中。来自Swiss-Prot数据库的真核蛋白的已确认编码cDNA构成了一组真阳性,来自RNAdb的ncRNA和NONCODE组成了真阴性。十倍交叉验证表明,CONC以大约97%的特异性和98%的灵敏度将编码RNA与ncRNA区别开来。我们的方法应用于FANTOM3数据集中的102,801个小鼠cDNA,可以可靠地鉴定出超过14,000个ncRNA,并估计ncRNA的总数约为28,000。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号