首页> 外文学位 >Computational Approaches to Prediction and Analysis of Human Leukocyte Antigen Genes.
【24h】

Computational Approaches to Prediction and Analysis of Human Leukocyte Antigen Genes.

机译:人类白细胞抗原基因预测和分析的计算方法。

获取原文
获取原文并翻译 | 示例

摘要

The Human Leukocyte Antigen (HLA) gene system is the most polymorphic region of the human genome, containing some of the strongest associations with autoimmune, infectious, and inflammatory diseases. It plays a crucial role in hematopoietic stem cell transplantation, where patients and donors are matched with respect to their HLA genes to maximize the chances of a successful transplant. As such, HLA data is a highly valuable asset for clinicians and researchers for elucidating various disease-driving bio- logical mechanisms. This thesis contains original research on the analysis of uncertainty in HLA data, exploration of the strong correlation structure in the region and prediction of HLA genes from widely available genetic markers.;We start by describing a novel method for correlated multi-label, multi-class prediction, which aims to solve the problem of prediction of HLA genes from widely available Single Nucleotide Polymorphism (SNP) data. Direct typing of HLA genes for large studies is expensive due to their extreme genetic polymorphism. Therefore, obtaining the HLA genes by prediction, rather than genetic typing, would be highly time- and cost-effective. In this study we use a two-step approach, involving label (gene) independent classifiers and label dependencies in the form of HLA haplotype frequencies, to predict HLA genes from SNP data. In addition, we propose different ways of integrating label dependency information into the prediction process and evaluate their impact on the prediction performance. The results from experiments on real-world data sets show that adding label dependencies into the prediction of HLA genes increases prediction accuracy when compared against the gene-independent approach.;Next, we aim to resolve and quantify the uncertainty that exists in HLA data sets. Due to the high genetic polymorphism of HLA genes, their molecular typing often results in a set of uncertain or ambiguous assignments, rather than an exact allele assignment at each gene. We propose a novel, information theoretic measure to quantify uncertainty in HLA typing. In addition, we demonstrate that using the HLA gene dependencies that reflect the strong correlation structure in the region, decreases the uncertainty in HLA data.;In the fourth chapter of the thesis, we propose a novel approach for multi-label prediction from uncertain data in the context of SNP-based prediction of HLA genes using ambiguous HLA data in training. Most existing HLA data sets contain uncertainty and, as such, need to be imputed to exact data before being used for training prediction models. Existing approaches for prediction of HLA genes from SNP data do not accommodate learning from uncertain data and, as such, miss the potential for an increased sample size and consequently improvements in prediction performance. In this thesis, we propose a novel algorithm for SNP-based prediction of HLA genes that utilizes ambiguous HLA data for building the prediction model. Additionally, we measure the impact that the uncertainty in the training data has on the prediction accuracy, and evaluate it on a real world data set. Our results show that the prediction from ambiguous HLA data generally performs better than the alternative approach which first imputes the ambiguous data into high-resolution HLA alleles and uses it to build the model.;The work in this thesis is a step toward understanding the immense challenges in the analysis of the HLA gene system. In this thesis, we: i) define and solve a problem of prediction of HLA genes from widely available genetic markers using a correlated multi- label, multi-class approach, ii) define and validate a measure to quantify the uncertainty present in HLA data sets, and iii) propose a novel approach to correlated prediction from uncertain data in the context of prediction of HLA genes. We conclude the thesis by discussing future work to further the understanding of this important genetic region through novel computational algorithms.
机译:人类白细胞抗原(HLA)基因系统是人类基因组中最多态的区域,包含与自身免疫性疾病,传染性疾病和炎性疾病最紧密的联系。它在造血干细胞移植中起着至关重要的作用,在该移植中,患者和供体的HLA基因相互匹配,从而最大程度地提高了成功移植的机会。因此,HLA数据对于阐明各种疾病驱动生物机制的临床医生和研究人员而言是非常有价值的资产。本论文包含有关HLA数据不确定性分析,探索该区域强相关结构以及从广泛使用的遗传标记预测HLA基因的原始研究。;我们首先描述一种用于关联多标签,多类预测,旨在解决从广泛使用的单核苷酸多态性(SNP)数据预测HLA基因的问题。由于其极端的遗传多态性,直接用于大型研究的HLA基因的分类非常昂贵。因此,通过预测而不是基因分型获得HLA基因将具有很高的时间和成本效益。在这项研究中,我们使用两步方法,包括不依赖标签(基因)的分类器和以HLA单倍型频率的形式存在的标签依赖性,来根据SNP数据预测HLA基因。此外,我们提出了将标签依赖性信息集成到预测过程中的不同方法,并评估了它们对预测性能的影响。真实数据集上的实验结果表明,与不依赖基因的方法相比,将标签依赖性添加到HLA基因的预测中可以提高预测准确性。接下来,我们旨在解决和量化HLA数据集中存在的不确定性。由于HLA基因的高遗传多态性,它们的分子分型通常会导致一组不确定或模棱两可的分配,而不是每个基因的精确等位基因分配。我们提出了一种新颖的信息理论方法来量化HLA分型的不确定性。此外,我们证明了使用反映该区域强相关结构的HLA基因依赖性,减少了HLA数据的不确定性。;论文的第四章,我们提出了一种新的方法,用于从不确定数据进行多标签预测在训练中使用模糊的HLA数据进行基于SNP的HLA基因预测。大多数现有的HLA数据集包含不确定性,因此,在用于训练预测模型之前,需要将其估算为准确的数据。从SNP数据预测HLA基因的现有方法无法适应从不确定数据中学习,因此,错过了增加样本量并因此提高预测性能的潜力。在本文中,我们提出了一种新的基于SNP的HLA基因预测算法,该算法利用模糊的HLA数据建立预测模型。此外,我们测量了训练数据中的不确定性对预测准确性的影响,并根据实际数据集对其进行了评估。我们的结果表明,从模糊的HLA数据进行的预测通常比替代方法更好,该方法首先将模糊的数据归入高分辨率HLA等位基因并使用它来构建模型。;本论文的工作是迈向了解巨大的一步HLA基因系统分析中的挑战。在本文中,我们:i)使用相关的多标签,多类方法定义和解决从广泛可用的遗传标记中预测HLA基因的问题,ii)定义和验证量化HLA数据中存在的不确定性的措施iii)提出了一种新方法,可以在HLA基因预测的背景下根据不确定数据进行相关预测。通过讨论未来的工作来总结本文,以通过新颖的计算算法进一步了解这一重要的遗传区域。

著录项

  • 作者

    Paunic, Vanja.;

  • 作者单位

    University of Minnesota.;

  • 授予单位 University of Minnesota.;
  • 学科 Computer Science.;Biology Molecular.;Health Sciences Immunology.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 116 p.
  • 总页数 116
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号