...
首页> 外文期刊>BMC Bioinformatics >Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
【24h】

Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms

机译:推断非模型生物的RNA-Seq衍生转录组组装中的善意转译

获取原文
   

获取外文期刊封面封底 >>

       

摘要

De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to reduce redundancy typically involves reassembly or clustering of assembled sequences. However, these approaches are mostly based on common word heuristics and often create clusters of biologically unrelated sequences, resulting in loss of unique transfrags annotations and propagation of mis-assemblies. Here, we propose a structured framework that consists of a few steps in pipeline architecture for Inferring Functionally Relevant Assembly-derived Transcripts (IFRAT). IFRAT combines 1) removal of identical subsequences, 2) error tolerant CDS prediction, 3) identification of coding potential, and 4) complements BLAST with a multiple domain architecture annotation that reduces non-specific domain annotation. We demonstrate that independent of the assembler, IFRAT selects bona fide transfrags (with CDS and coding potential) from the transcriptome assembly of a model organism without relying on post-assembly clustering or reassembly. The robustness of IFRAT is inferred on RNA-Seq data of Neurospora crassa assembled using de Bruijn graph-based assemblers, in single (Trinity and Oases-25) and multiple (Oases-Merge and additive or pooled) k-mer modes. Single k-mer assemblies contained fewer transfrags compared to the multiple k-mer assemblies. However, Trinity identified a comparable number of predicted coding sequence and gene loci to Oases pooled assembly. IFRAT selects bona fide transfrags representing over 94% of cumulative BLAST-derived functional annotations of the unfiltered assemblies. Between 4-6% are lost when orphan transfrags are excluded and this represents only a tiny fraction of annotation derived from functional transference by sequence similarity. The median length of bona fide transfrags ranged from 1.5kb (Trinity) to 2kb (Oases), which is consistent with the average coding sequence length in fungi. The fraction of transfrags that could be associated with gene ontology terms ranged from 33-50%, which is also high for domain based annotation. We showed that unselected transfrags were mostly truncated and represent sequences from intronic, untranslated (5′ and 3′) regions and non-coding gene loci. IFRAT simplifies post-assembly processing providing a reference transcriptome enriched with functionally relevant assembly-derived transcripts for non-model organism.
机译:通过合成测序技术产生的短转录片段(转录片段)的从头转录组组装通常会导致冗余的数据集,其未组装,部分组装或组装错误的转录本水平不同。旨在减少冗余的组装后处理通常涉及组装序列的重新组装或聚类。但是,这些方法主要基于常见的单词试探法,并且经常创建生物学上不相关的序列簇,从而导致唯一的transfrags注释丢失和错误组装的传播。在这里,我们提出了一个结构化的框架,该框架由流水线体系结构中的几个步骤组成,用于推断与功能相关的程序集转录本(IFRAT)。 IFRAT结合了1)删除相同子序列,2)容错CDS预测,3)识别编码潜能以及4)用减少非特定域注释的多域体系结构注释对BLAST进行补充。我们证明,独立于组装商,IFRAT从模型生物的转录组组装中选择真正的转基因片段(具有CDS和编码潜能),而无需依赖组装后聚类或重新组装。 IFRAT的鲁棒性可通过使用基于de Bruijn图的汇编器以单个(Trinity和Oases-25)和多个(Oases合并和加法或累加或合并)k-mer模式组装的芥菜神经孢子RNA-Seq数据推断得出。与多个k-mer组件相比,单个k-mer组件包含更少的transfrag。但是,Trinity确定了与Oases合并程序集相当数量的预测编码序列和基因位点。 IFRAT选择的真实翻译片段占未过滤程序集的BLAST累积功能注释的94%以上。当排除孤儿transfrags时,损失了4-6%,这仅代表通过序列相似性从功能转移中获得的注释的一小部分。真正转化片段的中位长度范围为1.5kb(三位一体)至2kb(Oases),与真菌中的平均编码序列长度一致。可以与基因本体术语相关的transfrags的比例为33-50%,这对于基于域的注释也很高。我们显示未选择的transfrags大多被截断,并代表来自内含子,未翻译(5'和3')区域和非编码基因位点的序列。 IFRAT简化了组装后的处理过程,为非模型生物提供了一个参考转录组,该转录组富含功能相关的组装衍生的转录本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号