首页> 外文期刊>BMC Bioinformatics >ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
【24h】

ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering

机译:ViVaMBC:使用基于模型的聚类从照明深度测序数据估算复杂人群中的病毒序列变异

获取原文
           

摘要

Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.
机译:深度测序可对复杂种群中的序列变异进行深入表征。但是,与技术相关的错误可能会妨碍对低频突变的有效评估。幸运的是,碱基检出得到质量分数的补充,该质量分数来自四重强度,Illumina测序的每个核苷酸类型都有一个通道。四个通道的最高强度确定了被调用的基准。不匹配的碱基通常可以通过次优的碱基来校正,即四联体中强度第二高的碱基。提出了一种基于病毒变体模型的聚类方法ViVaMBC,该方法探讨了用于鉴定和定量病毒变体的质量得分和次优碱基。 ViVaMBC经过优化,可以在密码子级别调用变异体(核苷酸三联体),从而可以就其抗病毒药物反应立即对变异体进行生物学解释。使用HCV质粒的混合物,我们表明我们的方法可以准确地估计低至0.5%的频率。当平均覆盖率达到25,000时,这些估计是无偏见的。与SNP调用者V-Phaser2,ShoRAH和LoFreq的比较表明,ViVaMBC对频率高于0.4%的变体具有极好的敏感性和特异性。与竞争者不同,ViVaMBC报告的假阳性结果的频率更高,低于0.4%,这可能部分是由于对样品和文库制备步骤中的错误所引入的人工变异的拾取。 ViVaMBC是第一个直接在密码子级别调用病毒变体的方法。该方法的优势在于基于质量得分对错误概率进行建模。尽管在我们的数据探索阶段,使用次优的碱基检出似乎很有希望,但其实用性受到限制。它们提供了稍微的灵敏度提高,但是并不能保证运行脱机基本调用方的额外计算成本。显然,质量分数中已经包含了许多信息,从而使基于模型的聚类过程能够调整大多数测序错误。总的来说,ViVaMBC的敏​​感性使得诸如PCR错误之类的技术限制开始形成低频变异检测的瓶颈。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号