首页> 外文学位 >The information bottleneck method for genome-wide association studies.
【24h】

The information bottleneck method for genome-wide association studies.

机译:用于全基因组关联研究的信息瓶颈方法。

获取原文
获取原文并翻译 | 示例

摘要

In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations.;In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view.;In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states.;Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study.;From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4.;A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07.;Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.
机译:在人群研究中,大多数当前方法侧重于通过测试疾病与健康人群之间或不同人群之间的基因型频率差异,一次识别一种与结果相关的SNP。但是,同时测试大量的SNP存在多次测试的问题,并且会给出假阳性结果。尽管可以通过Bonferroni校正,置换测试和错误发现率等几种方法有效地解决此问题,但可能无法确定每种基因的联合作用模式,每种基因作用都很弱。随着高通量基因分型技术的可用性,在整个基因组中搜索多个分散的SNP并建模它们对目标变量的联合作用已成为可能。在整个基因组研究中,对于数百万个SNP而言,穷举搜索所有SNP子集在计算上是不可行的。提出了几种有效的特征选择方法与分类函数相结合的方法,以在特征SNP的数量远远超过观测数量的大数据集中寻找最优的SNP子集。在本研究中,我们分两步实现这一目标。首先,我们通过有效的过滤方法选择了1000个SNP,然后围绕分类器执行特征选择,以识别用于预测疾病的最佳SNP子集。此外,我们还开发了一种新颖的分类方法-包裹在不同搜索算法中的顺序信息瓶颈方法,以识别SNP的最佳子集以对结果变量进行分类。在分类性能方面,将该新方法与经典线性判别分析进行了比较。最后,我们进行了卡方检验,从另一个角度观察了每个SNP与疾病之间的关系。总的来说,我们的结果表明,通过线性判别分析(LDA)使用敏感性和特异性的谐和均值(HMSS)过滤特征)比在我们的研究中使用LDA训练准确性或相互信息要好。我们的结果还表明,基于最佳100个复合2-SNP穷举搜索具有一个SNP,两个SNP或3个SNP的小子集可以找到最佳子集,并且通过启发式算法进一步包含更多SNP并不总是会提高性能SNP子集。尽管可以应用顺序正向浮动选择来防止正向选择的嵌套效应,但是由于过拟合无法观察到更复杂的子集状态,因此它并不总是优于后者,因为我们的结果还表明,将HMSS作为评估误差的标准可以在不平衡数据中使用函数的分类能力,而无需修改原始数据集,而不会影响分类精度。我们的四项研究表明,可以采用一种新的无监督技术顺序信息瓶颈(sIB)来预测结果,并且其检测目标状态的能力要优于研究中的传统LDA。通过sIB预测CVD,中风,CAD和牛皮癣的最佳测试概率HMSS分别为0.59406、0.641815、0.645315和0.678658。就群体预测准确性而言,如果要求病例间的检验准确性不低于0.4,则在四项研究中用于诊断对照正常状况的sIB的最高检验准确性可分别达到0.708999、0.863216、0.639918和0.850275。另一方面,如果要求对照之间的测试准确度至少为0.4,则在四项研究中sIB诊断疾病的最高测试准确度可以分别达到0.748644、0.789916、0.705701和0.749436。通过卡方检验进行的广泛关联研究表明,在CVD的Framingham心脏研究中,在临界水平9.09451E-08上未检测到明显的SNP。 WTCCC的研究结果只能检测到两个与CAD相关的重要SNP。在牛皮癣的全基因组研究中,通过卡方检验,在临界值为1.11E-07时,分类结果令人印象深刻的大多数前20个SNP标记物也与疾病显着相关;尽管我们的分类方法可以达到较高的准确性在这项研究中,对这些分类结果的完整描述(95%置信区间或差异的统计检验)需要更具成本效益的方法或高效的计算系统,而这在我们的全基因组研究中目前尚无法实现。我们还应注意,本研究的目的是鉴定具有高预测能力的SNP子集,而具有良好判别力的SNP不一定是该疾病的因果标记。

著录项

  • 作者

    Fang, Shenying.;

  • 作者单位

    The University of Texas School of Public Health.;

  • 授予单位 The University of Texas School of Public Health.;
  • 学科 Biology Biostatistics.;Biology Bioinformatics.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 362 p.
  • 总页数 362
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号