...
首页> 外文期刊>Genome research >A probabilistic approach for SNP discovery in high-throughput human resequencing data.
【24h】

A probabilistic approach for SNP discovery in high-throughput human resequencing data.

机译:一种在高通量人类重测序数据中发现SNP的概率方法。

获取原文
获取原文并翻译 | 示例

摘要

New high-throughput sequencing technologies are generating large amounts of sequence data, allowing the development of targeted large-scale resequencing studies. For these studies, accurate identification of polymorphic sites is crucial. Heterozygous sites are particularly difficult to identify, especially in regions of low coverage. We present a new strategy for identifying heterozygous sites in a single individual by using a machine learning approach that generates a heterozygosity score for each chromosomal position. Our approach also facilitates the identification of regions with unequal representation of two alleles and other poorly sequenced regions. The availability of confidence scores allows for a principled combination of sequencing results from multiple samples. We evaluate our method on a gold standard data genotype set from HapMap. We are able to classify sites in this data set as heterozygous or homozygous with 98.5% accuracy. In de novo data our probabilistic heterozygote detection ("ProbHD") is able to identify 93% of heterozygous sites at a <5% false call rate (FCR) as estimated based on independent genotyping results. In direct comparison of ProbHD with high-coverage 1000 Genomes sequencing available for a subset of our data, we observe >99.9% overall agreement for genotype calls and close to 90% agreement for heterozygote calls. Overall, our data indicate that high-throughput resequencing of human genomic regions requires careful attention to systematic biases in sample preparation as well as sequence contexts, and that their impact can be alleviated by machine learning-based sequence analyses allowing more accurate extraction of true DNA variants.
机译:新的高通量测序技术正在产生大量的序列数据,从而可以开发有针对性的大规模重测序研究。对于这些研究,准确鉴定多态性位点至关重要。杂合位点特别难以识别,尤其是在覆盖率较低的地区。我们提出了一种新的策略,通过使用机器学习方法来识别单个个体中的杂合位点,该方法为每个染色体位置生成一个杂合度得分。我们的方法还有助于鉴定具有两个等位基因和其他序列较差区域的不等代表的区域。置信度得分的可用性可对多个样品的测序结果进行原则组合。我们根据HapMap的金标准数据基因型评估了我们的方法。我们能够以98.5%的准确性将数据集中的位点分类为杂合或纯合。在重新数据中,根据独立的基因分型结果估计,我们的概率杂合子检测(“ ProbHD”)能够以<5%的错误检出率(FCR)鉴定93%的杂合位点。在ProbHD与可用于部分数据的高覆盖率1000基因组测序的直接比较中,我们观察到基因型调用的总体一致性> 99.9%,杂合子调用的一致性接近90%。总体而言,我们的数据表明,人类基因组区域的高通量重测序需要仔细注意样品制备以及序列背景方面的系统性偏见,并且可以通过基于机器学习的序列分析来减轻其影响,从而更准确地提取真实的DNA变体。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号