...
首页> 外文期刊>EURASIP journal on advances in signal processing >Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k -mers
【24h】

Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k -mers

机译:基于使用序列基序,信息理论序列特征提取和k-mers的microRNA对物种进行分类

获取原文
           

摘要

Diseases like cancer can manifest themselves through changes in protein abundance, and microRNAs (miRNAs) play a key role in the modulation of protein quantity. MicroRNAs are used throughout all kingdoms and have been shown to be exploited by viruses to modulate their host environment. Since the experimental detection of miRNAs is difficult, computational methods have been developed. Many such tools employ machine learning for pre-miRNA detection, and many features for miRNA parameterization have been proposed. To train machine learning models, negative data is of importance yet hard to come by; therefore, we recently started to employ pre-miRNAs from one species as positive data versus another species’ pre-miRNAs as negative examples based on sequence motifs and k-mers. Here, we introduce the additional usage of information-theoretic (IT) features. Pre-miRNAs from one species were used as positive and another species’ pre-miRNAs as negative training data for machine learning. The categorization capability of IT and k-mer features was investigated. Both feature sets and their combinations yielded a very high accuracy, which is as good as the previously suggested sequence motif and k-mer based method. However, for obtaining a high performance, a sufficiently large phylogenetic distance between the species and sufficiently high number of pre-miRNAs in the training set is required. To examine the contribution of the IT and k-mer features, an information gain-based feature ranking was performed. Although the top 3 are IT features, 80% of the top 100 features are k-mers. The comparison of all three individual approaches (motifs, IT, and k-mers) shows that the distinction of species based on their pre-miRNAs k-mers are sufficient. IT sequence feature extraction enables the distinction among species and is less computationally expensive than motif calculations. However, since IT features need larger amounts of data to have enough statistics for producing highly accurate results, future categorization into species can be effectively done using k-mers only. The biological reasoning for this is the existence of a codon bias between species which can, at least, be observed in exonic miRNAs. Future work in this direction will be the ab initio detection of pre-miRNA. In addition, prediction of pre-miRNA from RNA-seq can be done.
机译:像癌症之类的疾病可以通过蛋白质丰度的变化来表现出来,而microRNA(miRNA)在蛋白质数量的调节中起着关键作用。 MicroRNA在所有王国中都有使用,并且已被病毒利用来调节其宿主环境。由于miRNA的实验检测很困难,因此已经开发了计算方法。许多此类工具采用机器学习进行pre-miRNA检测,并且已经提出了许多用于miRNA参数化的功能。为了训练机器学习模型,负面数据很重要,但很难获得。因此,基于序列基序和k-mers,我们最近开始将一种物种的pre-miRNA作为阳性数据,而另一种物种的pre-miRNA作为阴性实例。在这里,我们介绍了信息理论(IT)功能的其他用法。一种物种的pre-miRNA被用作阳性,另一种物种的pre-miRNA被用作机器学习的阴性训练数据。研究了IT和k-mer功能的分类能力。这两个特征集及其组合都产生了非常高的准确性,这与以前建议的序列基序和基于k-mer的方法一样好。然而,为了获得高性能,需要物种之间足够大的系统发育距离和训练集中足够多的pre-miRNA。为了检查IT和k-mer功能的贡献,执行了基于信息增益的功能排名。尽管排名前三的是IT功能,但排名前100的功能中有80%是k-mers。对所有三种单独方法(基序,IT和k-mers)的比较表明,基于其前miRNAs k-mers进行物种区分是足够的。 IT序列特征提取可以区分物种,并且计算成本比主题计算要低。但是,由于IT功能需要大量数据才能具有足够的统计信息来产生高度准确的结果,因此将来仅使用k-mers即可有效地将其分类为物种。生物学上的原因是物种之间存在密码子偏倚,至少在外显子miRNA中可以观察到。在这个方向上的未来工作将是从头开始检测pre-miRNA。另外,可以完成从RNA-seq预测pre-miRNA的过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号