首页> 外文期刊>BMC Bioinformatics >Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate
【24h】

Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate

机译:通过对同工型和外显子特异性阅读测序速率进行建模来改善RNA-Seq表达估计

获取原文
           

摘要

Background The high-throughput sequencing technology, RNA-Seq, has been widely used to quantify gene and isoform expression in the study of transcriptome in recent years. Accurate expression measurement from the millions or billions of short generated reads is obstructed by difficulties. One is ambiguous mapping of reads to reference transcriptome caused by alternative splicing. This increases the uncertainty in estimating isoform expression. The other is non-uniformity of read distribution along the reference transcriptome due to positional, sequencing, mappability and other undiscovered sources of biases. This violates the uniform assumption of read distribution for many expression calculation approaches, such as the direct RPKM calculation and Poisson-based models. Many methods have been proposed to address these difficulties. Some approaches employ latent variable models to discover the underlying pattern of read sequencing. However, most of these methods make bias correction based on surrounding sequence contents and share the bias models by all genes. They therefore cannot estimate gene- and isoform-specific biases as revealed by recent studies. Results We propose a latent variable model, NLDMseq, to estimate gene and isoform expression. Our method adopts latent variables to model the unknown isoforms, from which reads originate, and the underlying percentage of multiple spliced variants. The isoform- and exon-specific read sequencing biases are modeled to account for the non-uniformity of read distribution, and are identified by utilizing the replicate information of multiple lanes of a single library run. We employ simulation and real data to verify the performance of our method in terms of accuracy in the calculation of gene and isoform expression. Results show that NLDMseq obtains competitive gene and isoform expression compared to popular alternatives. Finally, the proposed method is applied to the detection of differential expression (DE) to show its usefulness in the downstream analysis. Conclusions The proposed NLDMseq method provides an approach to accurately estimate gene and isoform expression from RNA-Seq data by modeling the isoform- and exon-specific read sequencing biases. It makes use of a latent variable model to discover the hidden pattern of read sequencing. We have shown that it works well in both simulations and real datasets, and has competitive performance compared to popular methods. The method has been implemented as a freely available software which can be found at https://github.com/PUGEA/NLDMseq .
机译:背景技术近年来,在转录组研究中,高通量测序技术RNA-Seq被广泛用于定量基因和同工型表达。困难阻碍了从数百万或数十亿个短生成的读数中进行准确的表达测量。一种是由选择性剪接导致的读段到参考转录组的不明确映射。这增加了估计同工型表达的不确定性。另一个是由于位置,测序,可映射性和其他未发现的偏倚来源,导致沿参考转录组的阅读分布不均匀。这违反了许多表达式计算方法(例如直接RPKM计算和基于泊松模型)的读取分布的统一假设。已经提出了许多方法来解决这些困难。一些方法采用潜在变量模型来发现读取序列的潜在模式。但是,大多数这些方法都基于周围序列的内容进行偏差校正,并由所有基因共享偏差模型。因此,他们无法估计最近研究揭示的基因和同工型特异性偏倚。结果我们提出了一个潜在变量模型NLDMseq,以估计基因和同工型表达。我们的方法采用潜在变量来模拟未知的同工型(从中产生读数)以及多个剪接变体的潜在百分比。对同工型和外显子特异的阅读测序偏倚进行建模,以解决阅读分布的不均匀性,并通过利用单个文库运行的多个泳道的复制信息进行识别。我们使用模拟和真实数据来验证我们的方法在基因和同工型表达计算的准确性方面的性能。结果表明,与流行的替代品相比,NLDMseq获得了竞争性基因和同工型表达。最后,将所提出的方法应用于差异表达(DE)的检测,以证明其在下游分析中的有用性。结论拟议的NLDMseq方法提供了一种通过对异构体和外显子特异性阅读测序偏倚建模来从RNA-Seq数据准确估计基因和异构体表达的方法。它利用潜在变量模型来发现读取序列的隐藏模式。我们已经证明,它在模拟和真实数据集中都可以很好地工作,并且与流行方法相比具有竞争优势。该方法已作为可免费获得的软件实现,可以在https://github.com/PUGEA/NLDMseq中找到。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号