首页> 外文学位 >Genomic prediction using linkage disequilibrium and co-segregation.
【24h】

Genomic prediction using linkage disequilibrium and co-segregation.

机译:使用连锁不平衡和共分离的基因组预测。

获取原文
获取原文并翻译 | 示例

摘要

Genetic improvement for economically important traits in livestock populations has been revolutionized through the application of genomic selection, where the selection criterion for parents of future generations incorporates genomic estimated breeding values (GEBV). Genomic prediction is a statistical method that predicts GEBV based on high-density genotypes of single nucleotide polymorphisms (SNPs) with genome-wide coverage. The theoretical basis for genomic prediction is that the genetic variance of every quantitative trait locus (QTL) for a desired trait can be captured by SNPs due to linkage disequilibrium (LD) between QTL and SNPs. To date, most statistical models for genomic prediction are based on multiple regression of trait phenotypes on SNP genotypes. Informative prior distributions are usually specified for SNP effects that allow simultaneous estimation of all SNP effects (training). Computer simulation of genomic prediction has revealed that the accuracy of GEBV depends on the genetic basis of trait, the size of training population, and LD between QTL and SNPs, which is affected by historical and current effective population sizes (Ne), mutation, selection, population stratification, family structure, SNP density, and minor allele frequencies (MAF) of QTL and SNPs. With moderate to high level of LD, GEBV are expected to have significantly higher accuracy than breeding values estimated using pedigree relationships. In analyses of field datasets, higher accuracy is typically observed in populations that are closely related to the training population, whereas the accuracy in a distantly related population is often low or even zero. Further, prediction accuracy hardly improves by increasing the density of SNPs that are usually selected to have high MAF, which contradicts results from simulation studies. Evidence has been increasing that LD between QTL and SNPs in livestock populations is low because many QTL have much lower MAF than SNPs, and prediction accuracy mainly comes from co-segregation (CS) and additive relationships that are implicitly captured by SNP genotypes.;With low LD between QTL and SNPs, CS information is expected to capture QTL effects more accurately than LD information. CS refers to alleles at linked loci originating from the same parental chromosome, which is quantified by the identical grand-parental allele origins at linked loci. CS information by definition is independent from LD, but is affected by the distance between QTL and SNPs along the chromosome and current Ne, which is usually determined by the mating design for a specific breeding program. The objectives of this thesis were to develop a statistical method to model CS explicitly, and to study the effects of historical LD, current Ne, MAF of QTL and SNP density on the contributions of LD and CS information to prediction accuracy. The CS model was developed by following the transmission of QTL alleles using allele origins at SNPs. Simulated half-sib datasets were analyzed to study contributions of LD and CS information to prediction accuracy for datasets that included many unrelated families. Simulated datasets of extended pedigrees with different mating designs were analyzed to study contributions of LD and CS information to prediction accuracy across validation generations without retraining. Results from half-sib datasets showed that when LD between QTL and SNPs was low, the accuracy of the model that fits SNP genotypes (LD model) decreased when the training data size was increased by adding independent sire families, but accuracies from the CS model and a combined LD-CS model increased and plateaued rapidly with increasing the number of sire families. Results from half-sib datasets suggest that modeling CS explicitly improves prediction accuracy when LD between QTL and SNPs is low, especially when the training data size is increased by adding independent families. Results from extended pedigrees showed that the LD model resulted in high accuracy across validation generations only when LD between QTL and SNPs was high. With low LD between QTL and SNPs, modeling CS explicitly resulted in higher accuracy than the LD model across validation generations when the mating design generated a large number of close relatives. Results from extended pedigrees suggest that modeling both LD and CS explicitly is expected to improve prediction accuracy when current Ne is small, and LD between QTL and SNPs is low due to distinct MAF, which is the typical situation in most livestock populations.;Application of the CS and the LD-CS models in field datasets has two major difficulties. First, obtaining allele origins for genome-wide SNPs can be computationally demanding. Second, the application of the CS model is limited to populations with correctly recorded pedigrees. CS information in populations without pedigree can be explicitly captured by fitting SNP haplotypes. The reason is that, as shown by our previous studies, the association between 1-cM haplotypes and QTL alleles is complete with a high SNP density of 200 SNPs/cM, and therefore 1-cM haplotypes can accurately follow the transmission of QTL alleles from the most recent common ancestor. Simulated datasets of extended pedigrees with different mating designs were analyzed to study contributions of fitting SNP genotypes and haplotypes to prediction accuracy across validation generations without retraining. Results showed that fitting both SNP genotypes and haplotypes had similar accuracy as fitting only SNP genotypes when LD between QTL and SNPs was high, but had significantly higher accuracy than fitting SNP genotypes when LD between QTL and SNPs was low. In the analyses of several egg quality traits of commercial layer chickens, fitting both SNP genotypes and haplotypes improved prediction accuracy for traits for which the accuracy was almost zero by fitting only SNP genotypes. Fitting haplotypes is effective to capture CS information for genomic prediction, especially when LD between QTL and SNPs is low and LD contributes little to prediction accuracy.;In conclusion, genomic prediction models that fit SNP genotypes capture both LD and CS information. When most QTL have much lower MAF than SNPs, LD between QTL and SNPs is low, and the accuracy obtained from fitting SNP genotypes is mainly contributed by CS information that is implicitly captured by SNP genotypes. This accuracy decreases when the training data size is increased by adding independent families, and deteriorates across validation generations without retraining, because CS information captured by SNP genotypes over long chromosome distances erodes rapidly by recombination. CS information can be explicitly captured by modeling transmission of putative QTL alleles within short chromosome regions using allele origins at SNPs. Modeling CS explicitly has limited contribution to accuracy when LD between QTL and SNPs is high, but has substantial contribution to accuracy when LD between QTL and SNPs is low. CS information has greater contribution to accuracy in populations with larger current Ne, because fewer haplotypes segregate in a population with a smaller current Ne, and the effect of each haplotype can be estimated more accurately. Therefore, modeling CS explicitly is expected to result in high accuracy across validation generations in mating designs that create small current Ne. For populations without pedigree information, CS information can be modeled explicitly by fitting SNP haplotypes within short chromosome regions. Fitting haplotypes captures as much CS information as modeling CS by following the transmission of QTL alleles of pedigree founders, but also captures CS information from most recent common ancestors. Although fitting both SNP genotypes and haplotypes improved accuracy for several traits in layer chickens for which the SNP model had low accuracy, the potential advantage of the SNP-haplotype model in improving accuracy for livestock populations requires further study.
机译:通过应用基因组选择,对牲畜种群中具有重要经济意义的性状进行了遗传改良,其中子孙后代的选择标准结合了基因组估计育种值(GEBV)。基因组预测是一种统计方法,可基于具有全基因组覆盖范围的单核苷酸多态性(SNP)的高密度基因型来预测GEBV。基因组预测的理论基础是,由于QTL和SNP之间的连锁不平衡(LD),SNP可以捕获所需性状的每个定量性状基因座(QTL)的遗传方差。迄今为止,大多数用于基因组预测的统计模型都是基于SNP基因型的性状表型的多元回归。通常为SNP效果指定信息性先验分布,从而可以同时估计所有SNP效果(训练)。基因组预测的计算机模拟显示,GEBV的准确性取决于性状的遗传基础,训练种群的大小以及QTL和SNP之间的LD,这受历史和当前有效种群大小(Ne),突变,选择的影响,QTL和SNP的群体分层,家庭结构,SNP密度和次要等位基因频率(MAF)。对于中等至高水平的LD,预计GEBV的准确性将大大高于使用谱系关系估计的育种值。在现场数据集的分析中,通常在与训练人口密切相关的人口中观察到较高的准确性,而在远距离相关人口中的准确性通常较低,甚至为零。此外,通过增加通常被选择具有高MAF的SNP的密度很难提高预测精度,这与仿真研究的结果相矛盾。越来越多的证据表明,由于许多QTL的MAF均比SNP低得多,因此牲畜群体中QTL和SNP之间的LD低,并且预测准确性主要来自SNP基因型隐含的共分离(CS)和加性关系。由于QTL和SNP之间的LD低,预计CS信息比LD信息更准确地捕获QTL效应。 CS是指源自相同亲本染色体的连锁基因座处的等位基因,其由连锁基因座处的相同祖父母等位基因来源定量。根据定义,CS信息与LD无关,但受QTL和SNP沿染色体的距离和当前Ne的影响,这通常由特定育种程序的交配设计决定。本文的目的是开发一种统计方法来对CS进行明确建模,并研究历史LD,当前Ne,QTL的MAF和SNP密度对LD和CS信息对预测准确性的贡献。通过使用SNP上的等位基因起源跟踪QTL等位基因的传播,开发了CS模型。分析了模拟的半同胞数据集,以研究LD和CS信息对包括许多不相关族的数据集的预测准确性的贡献。分析了具有不同交配设计的扩展谱系的模拟数据集,以研究LD和CS信息对验证代之间预测准确性的贡献,而无需重新训练。来自半同胞数据集的结果表明,当QTL和SNP之间的LD低时,当通过添加独立的父系家族来增加训练数据量时,适合SNP基因型的模型(LD模型)的准确性下降,但CS模型具有准确性LD-CS组合模型随着父系家庭数量的增加而迅速增加并达到稳定。半同胞数据集的结果表明,当QTL和SNP之间的LD低时,尤其是当通过添加独立族增加训练数据量时,建模CS可以显着提高预测准确性。扩展血统书的结果表明,只有当QTL和SNP之间的LD高时,LD模型才能在整个验证世代中实现高精度。由于QTL和SNP之间的LD较低,因此在交配设计中生成大量近亲时,在验证世代中,与CS相比,建模CS显着提高了准确性。扩展血统书的结果表明,当当前Ne较小时,明确地对LD和CS进行建模有望提高预测准确性,并且由于MAF不同,QTL和SNP之间的LD较低,这是大多数牲畜种群的典型情况。现场数据集的CS和LD-CS模型存在两个主要困难。首先,在计算上需要获得全基因组SNP的等位基因来源。其次,CS模型的应用仅限于具有正确记录的血统的人群。没有谱系的人群中的CS信息可以通过拟合SNP单倍型明确捕获。原因是,如我们先前的研究所示,1-cM单倍型与QTL等位基因之间的关联以200 SNPs / cM的高SNP密度完成,因此1-cM单倍型可以准确地追踪来自最新共同祖先的QTL等位基因的传播。分析了具有不同交配设计的扩展谱系的模拟数据集,以研究适合的SNP基因型和单倍型对整个验证世代的预测准确性的贡献,而无需重新训练。结果显示,当QTL和SNP之间的LD高时,拟合SNP基因型和单倍型与仅拟合SNP基因型具有相似的准确性,但是当QTL和SNP之间的LD低时,拟合SNP基因型的准确性明显高于对SNP基因型进行拟合。在对商品蛋鸡的几种蛋品质性状进行分析时,既适合SNP基因型又适合单倍型,通过仅适合SNP基因型,可以提高准确度几乎为零的性状的预测准确性。拟合单倍型可以有效地捕获CS信息以进行基因组预测,尤其是当QTL和SNP之间的LD低且LD对预测准确性的贡献不大时。;总之,适合SNP基因型的基因组预测模型可以捕获LD和CS信息。当大多数QTL的MAF都比SNP低得多时,QTL和SNP之间的LD较低,而通过拟合SNP基因型获得的准确性主要由CS信息所贡献,而CS信息被SNP基因型隐式捕获。当通过添加独立家族增加训练数据大小时,此准确性会降低,并且在不重新训练的情况下,整个验证世代都会降低准确性,这是因为SNP基因型在长染色体距离上捕获的CS信息会因重组而迅速消失。通过使用SNP上的等位基因起源对短染色体区域内的QTL等位基因的传输进行建模,可以明确捕获CS信息。当QTL和SNP之间的LD高时,对CS建模显着地对准确性的贡献有限,但是当QTL和SNP之间的LD低时,对CS的准确性有实质性贡献。 CS信息对具有较高电流Ne的种群的准确性有更大的贡献,因为在具有较小电流Ne的种群中分离出的单倍型较少,并且可以更准确地估算每种单倍型的影响。因此,明确地建模CS有望在产生小电流Ne的配对设计中的验证世代之间实现高精度。对于没有血统信息的人群,可以通过在短染色体区域内拟合SNP单倍型来明确建模CS信息。通过追踪谱系创始人的QTL等位基因,拟合单倍型可以捕获与建模CS一样多的CS信息,但是还可以捕获来自最近共同祖先的CS信息。尽管适合SNP基因型和单倍型都提高了SNP模型准确性较低的蛋鸡几个性状的准确性,但SNP单倍型模型在提高牲畜种群准确性方面的潜在优势尚待进一步研究。

著录项

  • 作者

    Sun, Xiaochen.;

  • 作者单位

    Iowa State University.;

  • 授予单位 Iowa State University.;
  • 学科 Genetics.;Animal sciences.;Statistics.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 201 p.
  • 总页数 201
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号