首页> 外文学位 >An information theoretic framework for identification and modeling of gene-gene and gene-environment interactions.
【24h】

An information theoretic framework for identification and modeling of gene-gene and gene-environment interactions.

机译:鉴定和建模基因-基因和基因-环境相互作用的信息理论框架。

获取原文
获取原文并翻译 | 示例

摘要

In this dissertation, we develop, extend, validate and apply information theoretic metrics for identification and characterization of interactions among genetic variations in the epidemiological studies as studies have linked the complex epidemiological associations between genetic variations with the risk of developing many diseases. We investigate interactions between genes (referred to as gene-gene interactions or GGI) and between genes and non-genetic factors or environmental variables (referred to as gene-environment interactions or GEI) and systematically investigate the dependence of our metrics on genetic and study-design factors to identify the GGI/GEI and enable a visual presentation of the results. We also develop several simulation strategies to be used extensively for performance evaluation because the underlying structure and true relationships between genetic and environmental factors in experimental data sets are rarely known with certainty.;Also, the high dimensionality of large data sets (e.g. from genome-wide studies) and presence of confounding factors like multiple correlations (or linkage disequilibrium among genes) and genetic heterogeneity results in combinatorial explosion of the number of possible interactions present in the data. This combinatorial growth makes it computationally difficult, if not impossible, to exhaustively assess the full range of predictor variables for potential interactions associated with the trait or phenotype variables and diseases in epidemiological studies. Therefore, we develop and evaluate a set of algorithms capable of efficiently searching the combinatorial space for mining significant and non-redundant interactions for both discrete and quantitative phenotypes and conduct detailed power, false-discovery rate and sample size analysis for epidemiological studies.;In GEI analysis, the presence of high degree of linkage disequilibrium among the genetic variables results in several interactions to contain redundant information regarding the phenotype variable. Therefore it is essential to prune a set of GEI using a modeling step which we define as the process of identifying a parsimonious set of combinations or variables capable of explaining the disease phenotype/trait variable that will avoid over- and under-fitted models. We develop a novel algorithm that uses information theoretic metrics and their properties to efficiently perform the model synthesis task.;Another principal challenge in GEI analyses is to develop metrics for prioritization of genetic variables for sequencing studies that incorporates knowledge from interactions between the genes. The gene-environment associations identified from large scale genotyping studies require large follow-on studies to comprehensively sequence the disease-associated regions to enable discovery of less common genetic variations that may be contributing to disease. Such comprehensive follow up studies are resource intensive and require large sample sizes so that it is essential to leverage the available information from existing genotyping studies to identify the most promising disease-associated regions and the possible environmental factors. Prioritizing genetic regions involved in GGI or GEI for sequencing studies can be difficult because the number of interactions, the order of interactions and their magnitudes can vary considerably making it difficult to make decisions regarding the relative importance of, e.g., a few large magnitude interactions vis--vis numerous interactions of moderate magnitude. In this research, we develop a novel metric for effectively visualizing and ranking the genetic and environmental variables involved in numerous statistical interactions.;Finally, often in genetic data sets, the phenotype or trait variable is absent and it is useful to mine statistical interactions among the genetic variables in an unsupervised fashion that can highlight the underlying biological interactions among the genes and proteins present in pathways. To address such analyses, in this dissertation, we study the problem of mining statistically significant correlation patterns and interaction information in genetic data. We develop novel concepts of combinations of variables containing highly significant, moderately significant and non-significant correlation information and present some bounds on correlation information and develop several pruning strategies utilizing these bounds to efficiently prune the combinatorial search space. Using the bounds and pruning strategies, we develop efficient search algorithms to mine such associations in an efficient and effective manner and also critically examine the performance of our proposed mining algorithms. (Abstract shortened by UMI.)
机译:在本文中,我们研究,扩展,验证和应用信息理论指标,用于流行病学研究中遗传变异之间的相互作用的鉴定和表征,因为研究已将遗传变异之间的复杂流行病学关联与患多种疾病的风险联系在一起。我们调查基因之间的相互作用(称为基因-基因相互作用或GGI)以及基因与非遗传因素或环境变量之间的相互作用(称为基因-环境相互作用或GEI),并系统地研究我们的指标对遗传和研究的依赖性-设计因素以识别GGI / GEI并以可视方式呈现结果。我们还开发了几种可广泛用于性能评估的仿真策略,因为很少能确定地知道实验数据集的基本结构以及遗传和环境因素之间的真实关系。此外,大型数据集的高维度(例如,来自基因组广泛的研究)以及诸如多重相关性(或基因之间的连锁不平衡)和遗传异质性等混淆因素的存在导致数据中可能存在的相互作用数量的组合爆炸式增长。这种组合的增长使得在流行病学研究中,即使不是不可能,也很难在计算上全面评估与特征或表型变量和疾病相关的潜在相互作用的全部预测变量。因此,我们开发并评估了一套算法,该算法能够有效地搜索组合空间,以挖掘离散和定量表型的重大和非冗余相互作用,并进行详细的功效,错误发现率和样本量分析以进行流行病学研究。 GEI分析,遗传变量之间高度连锁不平衡的存在导致了几种相互作用,以包含有关表型变量的冗余信息。因此,有必要使用建模步骤来修剪一组GEI,我们将其定义为识别能够解释疾病表型/特征变量的简约组合或变量集的过程,以避免过拟合的模型。我们开发了一种新颖的算法,该算法利用信息理论指标及其属性有效地执行模型合成任务。GEI分析中的另一个主要挑战是为测序研究开发遗传变量优先级的指标,该指标将来自基因之间相互作用的知识纳入其中。从大规模基因分型研究中鉴定出的基因与环境的关联需要进行大量后续研究,以对与疾病相关的区域进行全面测序,以发现可能导致疾病的罕见遗传变异。这种全面的随访研究需要大量资源,并且需要大量样本,因此必须利用现有基因分型研究的可用信息来确定最有希望的疾病相关区域和可能的环境因素。优先确定参与GGI或GEI的遗传区域以进行测序研究可能很困难,因为相互作用的数量,相互作用的顺序及其大小可能相差很大,因此很难就一些相对较大的相互作用的相对重要性做出决策。 -适用于众多中等大小的互动。在这项研究中,我们开发了一种新的度量标准,可以有效地可视化和排序涉及许多统计相互作用的遗传和环境变量。最后,通常在遗传数据集中,缺乏表型或性状变量,这对于挖掘之间的统计相互作用非常有用以无人监督的方式改变遗传变量,可以突显途径中存在的基因和蛋白质之间潜在的生物学相互作用。为了解决这些问题,本文研究了遗传数据中具有统计学意义的相关模式和交互信息的挖掘问题。我们开发了包含高度重要,中等重要和不重要的相关信息的变量组合的新颖概念,并给出了相关信息的一些界限,并利用这些界限开发了几种修剪策略以有效地修剪组合搜索空间。使用边界和修剪策略,我们开发了有效的搜索算法,以高效且有效的方式挖掘此类关联,并严格审查了我们提出的挖掘算法的性能。 (摘要由UMI缩短。)

著录项

  • 作者

    Chanda, Pritam.;

  • 作者单位

    State University of New York at Buffalo.;

  • 授予单位 State University of New York at Buffalo.;
  • 学科 Biology Bioinformatics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 247 p.
  • 总页数 247
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号