...
首页> 外文期刊>BMC Genomics >A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts
【24h】

A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts

机译:基于支持向量机的从蛋白质编码转录本中区分长非编码RNA的方法

获取原文
   

获取外文期刊封面封底 >>

       

摘要

In recent years, a rapidly increasing number of RNA transcripts has been generated by thousands of sequencing projects around the world, creating enormous volumes of transcript data to be analyzed. An important problem to be addressed when analyzing this data is distinguishing between long non-coding RNAs (lncRNAs) and protein coding transcripts (PCTs). Thus, we present a Support Vector Machine (SVM) based method to distinguish lncRNAs from PCTs, using features based on frequencies of nucleotide patterns and ORF lengths, in transcripts. The proposed method is based on SVM and uses the first ORF relative length and frequencies of nucleotide patterns selected by PCA as features. FASTA files were used as input to calculate all possible features. These features were divided in two sets: (i) 336 frequencies of nucleotide patterns; and (ii) 4 features derived from ORFs. PCA were applied to the first set to identify 6 groups of frequencies that could most contribute to the distinction. Twenty-four experiments using the 6 groups from the first set and the features from the second set where built to create the best model to distinguish lncRNAs from PCTs. This method was trained and tested with human (Homo sapiens), mouse (Mus musculus) and zebrafish (Danio rerio) data, achieving 98.21%, 98.03% and 96.09%, accuracy, respectively. Our method was compared to other tools available in the literature (CPAT, CPC, iSeeRNA, lncRNApred, lncRScan-SVM and FEELnc), and showed an improvement in accuracy by ≈3.00%. In addition, to validate our model, the mouse data was classified with the human model, and vice-versa, achieving ≈97.80% accuracy in both cases, showing that the model is not overfit. The SVM models were validated with data from rat (Rattus norvegicus), pig (Sus scrofa) and fruit fly (Drosophila melanogaster), and obtained more than 84.00% accuracy in all these organisms. Our results also showed that 81.2% of human pseudogenes and 91.7% of mouse pseudogenes were classified as non-coding. Moreover, our method was capable of re-annotating two uncharacterized sequences of Swiss-Prot database with high probability of being lncRNAs. Finally, in order to use the method to annotate transcripts derived from RNA-seq, previously identified lncRNAs of human, gorilla (Gorilla gorilla) and rhesus macaque (Macaca mulatta) were analyzed, having successfully classified 98.62%, 80.8% and 91.9%, respectively. The SVM method proposed in this work presents high performance to distinguish lncRNAs from PCTs, as shown in the results. To build the model, besides using features known in the literature regarding ORFs, we used PCA to identify features among nucleotide pattern frequencies that contribute the most in distinguishing lncRNAs from PCTs, in reference data sets. Interestingly, models created with two evolutionary distant species could distinguish lncRNAs of even more distant species.
机译:近年来,全球成千上万的测序项目已经产生了数量迅速增加的RNA转录物,从而产生了大量待分析的转录物数据。分析此数据时要解决的一个重要问题是区分长的非编码RNA(lncRNA)和蛋白质编码转录本(PCT)。因此,我们提出了一种基于支持向量机(SVM)的方法,以基于转录物中核苷酸模式和ORF长度的频率为特征,将lncRNA与PCT区别开来。所提出的方法基于SVM,并且使用由PCA选择的核苷酸模式的第一ORF相对长度和频率作为特征。 FASTA文件用作计算所有可能特征的输入。这些特征分为两组:(i)336个核苷酸模式频率; (ii)4个源自ORF的特征。将PCA应用于第一组,以识别可能最有助于区分的6组频率。使用第一组中的6个组和第二组中的特征进行的24个实验被构建为创建区分lncRNA和PCT的最佳模型。使用人(智人),小鼠(小家鼠)和斑马鱼(达尼奥里奥)数据对该方法进行了培训和测试,准确率分别达到98.21%,98.03%和96.09%。我们的方法与文献中提供的其他工具(CPAT,CPC,iSeeRNA,lncRNApred,lncRScan-SVM和FEELnc)进行了比较,其准确性提高了约3.00%。此外,为了验证我们的模型,将鼠标数据与人类模型进行了分类,反之亦然,在两种情况下均达到了约97.80%的准确性,这表明该模型并非过拟合。 SVM模型已通过大鼠(Rattus norvegicus),猪(Sus scrofa)和果蝇(Drosophila melanogaster)的数据进行了验证,并且在所有这些生物中的准确率均超过84.00%。我们的结果还显示,将81.2%的人类假基因和91.7%的小鼠假基因分类为非编码。此外,我们的方法能够重新注释Swiss-Prot数据库的两个未表征序列,并且很有可能是lncRNA。最后,为了使用该方法来注释源自RNA-seq的转录本,分析了先前鉴定的人,大猩猩(大猩猩大猩猩)和恒河猴(猕猴)的lncRNA,已成功分类98.62%,80.8%和91.9%,分别。结果表明,这项工作中提出的SVM方法具有区分lncRNA和PCT的高性能。为了建立模型,除了使用文献中有关ORF的已知功能之外,我们还使用PCA在参考数据集中识别核苷酸模式频率中对区分lncRNA和PCT贡献最大的特征。有趣的是,使用两个进化远距离物种创建的模型可以区分甚至更远距离物种的lncRNA。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号