...
【24h】

Model-based quality assessment and base-calling for second-generation sequencing data.

机译:基于模型的质量评估和第二代测序数据的碱基检出。

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads-strings of A,C,G, or T's, between 30 and 100 characters long-which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.
机译:第二代测序(sec-gen)技术可以并行测序数百万个DNA的短片段,从而使其能够以较低的价格和时间来组装复杂的基因组。实际上,最近成立的国际财团“ 1000基因组计划”计划对约1200人的基因组进行完全测序。在未来五年内,可以实现跨多个人群的大量样本在序列水平上进行比较分析的前景。这些数据在统计分析中提出了前所未有的挑战。例如,分析可对数百万个短核苷酸序列或长度为30至100个字符的A,C,G或T的读取字符串进行分析,这是对嘈杂的连续荧光强度测量(称为碱基对)进行复杂处理的结果打电话。碱基检出离散化过程的复杂性导致读取序列样本内和序列样本之间质量差异很大。处理质量的这种变化会导致偶发性但系统性的错误,我们发现这会误导离散序列读取数据的下游分析。例如,“ 1000个基因组计划”的主要目标是量化单个核苷酸水平上的跨样本变异。在这种分辨率下,测序中的小错误率被证明是重要的,尤其是对于罕见的变体。 Sec-gen测序是一种相对较新的技术,尚未完全了解潜在的偏差和模糊变异的来源。因此,对序列读数产生中固有的不确定性进行建模和量化至关重要。在本文中,我们提出了一个简单的模型来捕获Illumina / Solexa GA平台的基本调用过程中出现的不确定性。模型参数可以根据碱基调用的化学性质进行简单的解释,从而可以提供信息丰富且易于解释的指标,以捕获测序质量的变化。我们的模型可以在质量评估工具中轻松提供这些有用的估计,同时显着提高基本呼叫的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号