首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >Comparing Four Genome-Wide Association Study (GWAS) Programs with Varied Input Data Quantity
【24h】

Comparing Four Genome-Wide Association Study (GWAS) Programs with Varied Input Data Quantity

机译:与各种输入数据量的四个基因组关联研究(GWAS)程序进行比较

获取原文

摘要

Genome-wide association studies (GWAS) have served as primary methods for the past decade for identifying associations between genetic variants and traits or diseases. Many software packages have been developed for GWAS analysis based on different statistical models. One key factor influencing the statistical reliability of GWAS is the amount of input data used. Few studies have been conducted to investigate this effect by comparing the performance of GWAS programs using varied amounts of experimental data, especially in the context of plants and plant genomes. In this paper, we investigate how input data quantity influences output of four widely used GWAS programs, PLINK, TASSEL, GAPIT, and FaST-LMM. Both synthetic and real data are used. Standard GWAS output includes single nucleotide polymorphisms (SNPs) and their p-values. To evaluate the programs, p-values and q-values of SNPs, and Kendall rank correlation between output SNP lists, are used. Results show that with the same GWAS program, different Arabidopsis thaliana datasets demonstrate similar trends of rank correlation with varied input quantity, but differentiate on the numbers of SNPs passing a given p- or q-value threshold. In practice, experimental datasets may have samples containing varied numbers of biological replicates. We show that this variation in replicates influences the p-values of SNPs, but does not strongly affect the rank correlation. When comparing synthetic and real data, the output SNPs from synthetic data have similar rank correlation trends across all four GWAS programs, but the same measure from real data is diverse across the programs. In addition, the real data results in a linear-like increase in the numbers of significant SNPs with more input data, but the synthetic data does not follow this trend. This study provides guidance on selecting GWAS programs when varied experimental data is present and on selecting significant SNPs for subsequent study. It contributes to understanding how much input data is necessary to yield satisfying GWAS results.
机译:基因组关联研究(GWAS)曾担任过去十年的主要方法,用于识别遗传变异和特征或疾病之间的关联。基于不同统计模型的GWAS分析已经开发了许多软件包。影响GWAS统计可靠性的一个关键因素是所使用的输入数据量。已经进行了很少的研究以通过使用不同量的实验数据比较GWAS程序的性能来研究这种效果,特别是在植物和植物基因组的背景下。在本文中,我们研究了输入数据量如何影响四种广泛使用的GWAS程序的输出,Plink,Tassel,Gapit和Fast-LMM。使用合成和实际数据。标准GWAS输出包括单核苷酸多态性(SNP)及其p值。为了评估SNP的程序,P值和Q值,以及输出SNP列表之间的KENDALL等级相关性。结果表明,通过相同的GWAS程序,不同的拟南芥数据集表明了与变化的输入量的等级相关性的类似趋势,但是在通过给定的P - 或Q值阈值的SNP的数量上区分。在实践中,实验数据集可以具有含有不同数量的生物复制的样本。我们表明,复制的这种变化会影响SNP的p值,但不会强烈影响等级相关性。在比较合成和实际数据时,来自合成数据的输出SNP在所有四个GWAS程序中具有类似的等级相关趋势,但是从真实数据的相同措施在整个程序中多样化。此外,实际数据导致具有更多输入数据的重要SNP的数量的线性增加,但合成数据不遵循此趋势。本研究提供了有关在存在各种实验数据时选择GWAS程序的指导,并在选择后续研究中选择显着的SNPS。它有助于了解产生满足GWAS结果所必需的输入数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号