首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >Comparing Four Genome-Wide Association Study (GWAS) Programs with Varied Input Data Quantity
【24h】

Comparing Four Genome-Wide Association Study (GWAS) Programs with Varied Input Data Quantity

机译:比较四个全基因组关联研究(GWAS)程序和各种输入数据量

获取原文

摘要

Genome-wide association studies (GWAS) have served as primary methods for the past decade for identifying associations between genetic variants and traits or diseases. Many software packages have been developed for GWAS analysis based on different statistical models. One key factor influencing the statistical reliability of GWAS is the amount of input data used. Few studies have been conducted to investigate this effect by comparing the performance of GWAS programs using varied amounts of experimental data, especially in the context of plants and plant genomes. In this paper, we investigate how input data quantity influences output of four widely used GWAS programs, PLINK, TASSEL, GAPIT, and FaST-LMM. Both synthetic and real data are used. Standard GWAS output includes single nucleotide polymorphisms (SNPs) and their p-values. To evaluate the programs, p-values and q-values of SNPs, and Kendall rank correlation between output SNP lists, are used. Results show that with the same GWAS program, different Arabidopsis thaliana datasets demonstrate similar trends of rank correlation with varied input quantity, but differentiate on the numbers of SNPs passing a given p- or q-value threshold. In practice, experimental datasets may have samples containing varied numbers of biological replicates. We show that this variation in replicates influences the p-values of SNPs, but does not strongly affect the rank correlation. When comparing synthetic and real data, the output SNPs from synthetic data have similar rank correlation trends across all four GWAS programs, but the same measure from real data is diverse across the programs. In addition, the real data results in a linear-like increase in the numbers of significant SNPs with more input data, but the synthetic data does not follow this trend. This study provides guidance on selecting GWAS programs when varied experimental data is present and on selecting significant SNPs for subsequent study. It contributes to understanding how much input data is necessary to yield satisfying GWAS results.
机译:在过去的十年中,全基因组关联研究(GWAS)已成为鉴定遗传变异与性状或疾病之间关联的主要方法。已经基于不同的统计模型开发了许多用于GWAS分析的软件包。影响GWAS统计可靠性的一个关键因素是使用的输入数据量。很少有研究通过使用各种实验数据来比较GWAS程序的性能来研究这种影响,尤其是在植物和植物基因组的情况下。在本文中,我们研究输入数据量如何影响四种广泛使用的GWAS程序(PLINK,TASSEL,GAPIT和FaST-LMM)的输出。综合数据和真实数据都被使用。标准GWAS输出包括单核苷酸多态性(SNP)及其p值。为了评估程序,使用了SNP的p值和q值,以及输出SNP列表之间的Kendall等级相关性。结果表明,在相同的GWAS程序下,不同的拟南芥数据集显示了相似的等级相关趋势,具有不同的输入数量,但在通过给定p值或q值阈值的SNP数量上有所区别。在实践中,实验数据集可能具有包含不同数量的生物重复样本的样本。我们表明复制品中的这种变化会影响SNP的p值,但不会严重影响等级相关性。当比较合成数据和真实数据时,来自合成数据的输出SNP在所有四个GWAS程序中都具有相似的秩相关趋势,但是来自真实数据的相同度量在程序中各不相同。此外,实际数据会导致输入数据较多的重要SNP数量呈线性增加,但综合数据却没有遵循这种趋势。这项研究为在存在各种实验数据时选择GWAS程序以及为后续研究选择重要的SNP提供了指导。它有助于理解需要多少输入数据才能产生令人满意的GWAS结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号