...
首页> 外文期刊>BMC Genomics >Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
【24h】

Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms

机译:使用改进的减少表示法测序和SNP调用算法的直接比较生成猩猩种群基因组SNP数据集

获取原文
           

摘要

Background High-throughput sequencing has opened up exciting possibilities in population and conservation genetics by enabling the assessment of genetic variation at genome-wide scales. One approach to reduce genome complexity, i.e. investigating only parts of the genome, is reduced-representation library (RRL) sequencing. Like similar approaches, RRL sequencing reduces ascertainment bias due to simultaneous discovery and genotyping of single-nucleotide polymorphisms (SNPs) and does not require reference genomes. Yet, generating such datasets remains challenging due to laboratory and bioinformatical issues. In the laboratory, current protocols require improvements with regards to sequencing homologous fragments to reduce the number of missing genotypes. From the bioinformatical perspective, the reliance of most studies on a single SNP caller disregards the possibility that different algorithms may produce disparate SNP datasets. Results We present an improved RRL (iRRL) protocol that maximizes the generation of homologous DNA sequences, thus achieving improved genotyping-by-sequencing efficiency. Our modifications facilitate generation of single-sample libraries, enabling individual genotype assignments instead of pooled-sample analysis. We sequenced ~1% of the orangutan genome with 41-fold median coverage in 31 wild-born individuals from two populations. SNPs and genotypes were called using three different algorithms. We obtained substantially different SNP datasets depending on the SNP caller. Genotype validations revealed that the Unified Genotyper of the Genome Analysis Toolkit and SAMtools performed significantly better than a caller from CLC Genomics Workbench (CLC). Of all conflicting genotype calls, CLC was only correct in 17% of the cases. Furthermore, conflicting genotypes between two algorithms showed a systematic bias in that one caller almost exclusively assigned heterozygotes, while the other one almost exclusively assigned homozygotes. Conclusions Our enhanced iRRL approach greatly facilitates genotyping-by-sequencing and thus direct estimates of allele frequencies. Our direct comparison of three commonly used SNP callers emphasizes the need to question the accuracy of SNP and genotype calling, as we obtained considerably different SNP datasets depending on caller algorithms, sequencing depths and filtering criteria. These differences affected scans for signatures of natural selection, but will also exert undue influences on demographic inferences. This study presents the first effort to generate a population genomic dataset for wild-born orangutans with known population provenance.
机译:背景技术高通量测序通过评估全基因组范围内的遗传变异,为种群和保护遗传学带来了令人兴奋的可能性。减少基因组复杂性的一种方法,即仅研究基因组的一部分,是减少表达库(RRL)测序。与类似的方法一样,由于同时发现单核苷酸多态性(SNP)和进行基因分型,RRL测序降低了确定性偏倚,并且不需要参考基因组。然而,由于实验室和生物信息学问题,生成这样的数据集仍然具有挑战性。在实验室中,当前的方案需要在测序同源片段方面进行改进,以减少缺失基因型的数量。从生物信息学的角度来看,大多数研究都依赖于单个SNP调用者,而忽略了不同算法可能产生不同SNP数据集的可能性。结果我们提出了一种改进的RRL(iRRL)方案,该方案可最大程度地提高同源DNA序列的产生,从而提高了基于测序的基因分型效率。我们的修改有助于单样本库的生成,从而实现了个体基因型分配,而不是合并样本分析。我们对来自两个种群的31个野生出生个体的猩猩基因组进行了约1%的测序,中位覆盖率为41倍。使用三种不同的算法调用SNP和基因型。我们根据SNP调用者获得了完全不同的SNP数据集。基因型验证显示,Genome Analysis Toolkit和SAMtools的Unified Genotyper的性能明显优于CLC Genomics Workbench(CLC)的调用者。在所有有冲突的基因型调用中,CLC仅在17%的情况下是正确的。此外,两种算法之间存在冲突的基因型表现出系统性的偏差,即一个调用者几乎专门分配了杂合子,而另一个调用者几乎专门分配了纯合子。结论我们增强的iRRL方法​​极大地促进了通过测序进行基因分型,从而直接估计等位基因频率。我们对三个常用SNP调用者的直接比较强调了对SNP和基因型调用准确性的质疑,因为我们根据调用者算法,测序深度和过滤标准获得了截然不同的SNP数据集。这些差异影响了自然选择签名的扫描,但也会对人口统计学推论产生不适当的影响。这项研究提出了为具有已知种群起源的野生猩猩建立种群基因组数据集的第一项努力。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号