首页> 外文期刊>BMC proceedings. >A ν-support vector regression based approach for predicting imputation quality
【24h】

A ν-support vector regression based approach for predicting imputation quality

机译:基于ν支持向量回归的插补质量预测方法

获取原文
           

摘要

Background Decades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, preventing pooling and comparability of old and new data. For example, to pool together data collected by 550K chips with newer data collected by 900K chips, we will need to impute missing loci. Many imputation algorithms have been developed, but the posteriori probabilities estimated by those algorithms are not a reliable measure the quality of the imputation. Recently, many studies have used an imputation quality score (IQS) to measure the quality of imputation. The IQS requires to know true alleles to estimate. Only when the population and the imputation loci are identical can we reuse the estimated IQS when the true alleles are unknown. Methods Here, we present a regression model to estimate IQS that learns from imputation of loci with known alleles. We designed a small set of features, such as minor allele frequencies, distance to the nearest known cross-over hotspot, etc ., for the prediction of IQS. We evaluated our regression models by estimating IQS of imputations by BEAGLE for a set of GWAS data from the NCBI GEO database collected from samples from different ethnic populations. Results We construct a ν -SVR based approach as our regression model. Our evaluation shows that this regression model can accomplish mean square errors of less than 0.02 and a correlation coefficient close to 0.75 in different imputation scenarios. We also show how the regression results can help remove false positives in association studies. Conclusion Reliable estimation of IQS will facilitate integration and reuse of existing genomic data for meta-analysis and secondary analysis. Experiments show that it is possible to use a small number of features to regress the IQS by learning from different training examples of imputation and IQS pairs.
机译:背景技术数十年来,全基因组关联研究(GWAS)积累了大量的基因组数据,可以潜在地重用这些基因组数据以提高新研究的统计能力,但是随着生物技术的发展,已经使用了具有不同标记集的不同基因分型平台,从而防止了合并和新旧数据的可比性。例如,要将550K芯片收集的数据与900K芯片收集的更新数据集中在一起,我们将需要估算缺失的基因座。已经开发了许多插补算法,但是由那些算法估计的后验概率不是可靠的量度插补的质量。最近,许多研究使用插补质量评分(IQS)来衡量插补的质量。 IQS需要知道真实的等位基因以进行估计。仅当总体和估算位点相同时,才可以在真实等位基因未知时重用估计的IQS。方法在这里,我们提供了一个回归模型来估计从具有已知等位基因的基因座推算中获得的IQS。我们设计了一组功能,例如次要等位基因频率,到最近的已知交叉热点的距离等,以预测IQS。我们通过估计BEAGLE估算的IQS来评估我们的回归模型,这些估算是从NCBI GEO数据库中收集的来自不同种族样本的GWAS数据集。结果我们构建了一个基于ν-SVR的方法作为回归模型。我们的评估表明,在不同的插补方案中,该回归模型可以实现均方误差小于0.02且相关系数接近0.75。我们还展示了回归结果如何帮助消除关联研究中的误报。结论可靠的IQS估计将有助于整合和重用现有的基因组数据,以进行荟萃分析和二级分析。实验表明,可以通过从插补和IQS对的不同训练示例中学习,使用少量功能来回归IQS。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号