In genomic prediction, common analysis methods rely on a linear mixed-model framework to estimate SNP marker effects and breeding values of animals or plants. Ridge regression–best linear unbiased prediction (RR-BLUP) is based on the assumptions that SNP marker effects are normally distributed, are uncorrelated, and have equal variances. We propose DAIRRy-BLUP, a parallel, Distributed-memory RR-BLUP implementation, based on single-trait observations (>y), that uses the Average Information algorithm for restricted maximum-likelihood estimation of the variance components. The goal of DAIRRy-BLUP is to enable the analysis of large-scale data sets to provide more accurate estimates of marker effects and breeding values. A distributed-memory framework is required since the dimensionality of the problem, determined by the number of SNP markers, can become too large to be analyzed by a single computing node. Initial results show that DAIRRy-BLUP enables the analysis of very large-scale data sets (up to 1,000,000 individuals and 360,000 SNPs) and indicate that increasing the number of phenotypic and genotypic records has a more significant effect on the prediction accuracy than increasing the density of SNP arrays.
展开▼
机译:在基因组预测中,常用的分析方法依赖于线性混合模型框架来估计SNP标记的作用以及动植物的育种价值。 Ridge回归最佳线性无偏预测(RR-BLUP)基于以下假设:SNP标记效应呈正态分布,不相关且具有相等的方差。我们提出DAIRRy-BLUP,这是一种基于单特征观察(> y strong>)的并行,分布式内存RR-BLUP实现,该方法使用平均信息算法对方差分量进行受限的最大似然估计。 DAIRRy-BLUP的目标是能够分析大规模数据集,以提供更准确的标记效应和育种值估计。需要分布式内存框架,因为由SNP标记的数量确定的问题的维数可能会变得太大而无法由单个计算节点进行分析。初步结果表明,DAIRRy-BLUP能够分析非常大规模的数据集(多达1,000,000个个体和360,000个SNP),并表明增加表型和基因型记录的数量对预测准确性的影响远大于增加密度的影响。 SNP阵列。
展开▼