首页> 美国卫生研究院文献>Journal of Computational Biology >Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads
【2h】

Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads

机译:NGS读取中模式出现的正态和复合泊松近似

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

>Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (). In addition, can be found online ().
机译:>下一代测序(NGS)技术现已广泛用于许多生物学研究中。在NGS中,从目标基因组序列中随机采样序列读数。 NGS数据的大多数计算方法都是先将读段映射到基因组,然后根据映射的读段分析数据。由于许多生物具有未知的基因组序列,并且即使已知基因组序列,也无法将许多读数唯一地定位到基因组,因此需要其他分析方法来研究NGS数据。在这里,我们建议使用单词模式来分析NGS数据。单词模式计数(对一个或多个长序列中单词模式出现次数的概率分布的研究)在分子序列分析中发挥了重要作用。但是,目前尚无关于NGS读取中单词模式出现次数分布的研究。在本文中,我们为背景序列和从基因组读取序列的采样过程建立了概率模型。基于这些模型,我们提供了序列读取中单词模式出现次数的正态泊松近似和复合泊松近似,并带有近似误差的范围。主要挑战是在生成长背景序列以及使用NGS读取样本时考虑随机性。我们展示了在各种条件下,对于具有各种特征的不同模式,这些近似值的准确性。在现实的假设下,复合泊松近似在大多数情况下似乎胜过常规近似。这些近似分布可用于评估NGS数据中模式发生的统计显着性。然后,使用用于计算近似分布的理论和计算算法,使用转录因子GABP分析ChIP-Seq数据。可在线获得软件()。此外,可以在线()找到。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号