首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >Multi-purpose SNP Selection by the principal variables for a genetic study
【24h】

Multi-purpose SNP Selection by the principal variables for a genetic study

机译:基因研究的主要变量的多功能SNP选择

获取原文

摘要

In genome-wide association studies, the length of the single nucleotide polymorphisms (SNPs) has been drastically increased. The data may contain many near-duplicated SNPs in linkage equilibrium, which can cause difficulties in anaysis. It may also bring about many statistical problems in further analysis. Principal component analysis is a popular dimension reduction technique and is well known to be effective for many genetic association analyses. However, it is a linear combination of all the original variables, and does not provide direct interpretation about the original number of variables. The purpose of our study is to eliminate the redundant SNPs and select a smaller subset made of only the informative SNPs. We propose an unsupervised SNP selection algorithm based on the principal variable (PV) method. It achives the dimensionality reduction by selecting a subset of original variables called PVs that preserve as much information as possible. To find an optimal subset of SNPs, we focus on the criterion which minimizes the squared norm of the partial covariance matrix. We define principal component cluster by principal component analysis and choose the representative SNP with high loadings on important principal component on average. After discarding other SNPs in the PC cluster, we calculate the partial covariance matrix for the remaining variables given principal variable. To obtain the next representative SNP, the same procedure is iterated to the partial covariance matrix. The process repeats until there's no more variable to select or to meet some stopping criterion, the percentage variance in terms of trace or squared norm of the covariance matrix. The resulting subset of SNPs could be used for further analysis with multiple purposes such as gene-gene interactions. We illustrate the proposed method by real genotype data and compare its performance with five current selection methods for principal variables.
机译:在基因组 - 宽的协会研究中,单核苷酸多态性(SNP)的长度急剧增加。数据可能包含许多连锁平衡中的近近复制的SNP,这可能导致ADAYSIS造成困难。它还可以在进一步分析中带来许多统计问题。主要成分分析是一种流行的尺寸减少技术,众所周知是对许多遗传关联分析有效。但是,它是所有原始变量的线性组合,并且不提供关于原始变量数的直接解释。我们研究的目的是消除冗余SNP,并选择仅由信息SNP制成的较小子集。我们提出了一种基于主变量(PV)方法的无监督SNP选择算法。它通过选择称为PVS的原始变量的子集来达到维度,以保持尽可能多的信息。为了找到SNP的最佳子集,我们专注于最小化部分协方差矩阵的平方标准的标准。我们通过主成分分析定义主成分集群,并平均选择重要主成分上具有高负载量的代表性SNP。在PC集群中丢弃其他SNP后,我们计算剩余变量给定主体变量的部分协方差矩阵。为了获得下一个代表性SNP,相同的过程被迭代到部分协方差矩阵。该过程重复,直到没有更多的变量来选择或满足一些停止标准,在协方差矩阵的跟踪或平方标准方面的百分比方差。所得SNP的子集可用于进一步分析,具有多种目的,例如基因 - 基因相互作用。我们通过实际基因型数据说明了所提出的方法,并将其性能与主要变量的五个当前选择方法进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号