首页> 外文学位 >Algorithmic data fusion methods for tuberculosis.
【24h】

Algorithmic data fusion methods for tuberculosis.

机译:结核病的算法数据融合方法。

获取原文
获取原文并翻译 | 示例

摘要

Exponentially-growing genomic data after the advent of gene sequencing technologies shifted the emphasis on to the analysis of many datasets from as many sources as possible. Data from multiple sources in the form of matrices and tensors can be analyzed separately, or they can be coupled and decomposed simultaneously. This data deluge is also observed in patient datasets of tuberculosis (TB), an infectious disease caused by Mycobacterium tuberculosis complex (MTBC). Epidemiologists, clinicians, and health care practitioners aim to find transmission routes, detect or rule out possible outbreaks, and control TB. For this purpose, patient isolates are routinely genotyped by multiple biomarkers which include spacer oligonucleotide types (spoligotypes) and Mycobacterial Interspersed Repetitive Units - Variable Number Tandem Repeats (MIRU-VNTR). Now it remains to make inferences from this data congestion. In this thesis, we propose algorithmic data fusion methods for tuberculosis using multiple sources of information from MTBC strains and TB patients.;In the first study, we propose the Tensor Clustering Framework (TCF) on multiple-biomarker tensors (MBT) and subdivide major lineages of MTBC into sublineages via genomic data fusion. The MBT holds data from two biomarkers, spoligotypes and MIRU patterns. We factorize the MBT into its component matrices using multiway models. Based on the component matrix of strain mode, we cluster MTBC strains into sublineages. Our new definition of sublineages based on two biomarkers confirms some of the existing sublineages, and suggests subdividing or merging other sublineages.;In the second study, we propose a new mutation model of spoligotypes based on both spoligotypes themselves and MIRU patterns. The model uses a maximum parsimony method based on three genetic distance measures on these two biomarkers. The resulting putative mutation history of spoligotypes depicted via a spoligoforest shows notable topological attributes. Number of descendant spoligotypes follows a power-law distribution. In addition, number of mutations at each spacer in the DR region follows a spatially bimodal distribution. Based on this observation, we built two alternative models for mutation length frequency: Starting Point Model (SPM) and Longest Block Model (LBM). Both models plausibly fit mutation length frequency distribution in the spoligoforest.;In the third study, we propose the Unified Biclustering Framework (UBF) for host-pathogen association analysis of tuberculosis patients via genome-phenome data fusion. UBF is flexible in the sense that we can incorporate genetic distance between MTBC strains, spatial distance between TB patients, and time into domain knowledge, and factorize these joint datasets via coupled matrix-matrix and matrix-tensor factorization. We calculate feature pattern similarity matrix of (spoligotype, country) pairs and use it as input to our novel density-invariant biclustering algorithm. Finally, we select statistically significant biclusters using average best-match score. The resulting biclusters verify some of the well-known host-pathogen associations between MTBC strains and geographic distribution of their hosts, as well as suggest new patient-strain relationships.
机译:随着基因测序技术的出现,指数级增长的基因组数据转移到了对尽可能多来源的许多数据集的分析上。来自矩阵和张量形式的多个来源的数据可以分别进行分析,也可以同时进行耦合和分解。在结核病(TB)的患者数据集中也观察到了这种数据泛滥,结核病是由结核分枝杆菌复合物(MTBC)引起的传染病。流行病学家,临床医生和卫生保健从业者旨在寻找传播途径,发现或排除可能的爆发并控制结核病。为此目的,通常通过多种生物标志物对患者分离株进行基因分型,这些生物标志物包括间隔物寡核苷酸类型(spoligotypes)和分枝杆菌散布的重复单元-可变数目的串联重复序列(MIRU-VNTR)。现在,仍然需要根据这种数据拥塞情况进行推断。本文利用MTBC菌株和TB患者的多种信息来源,提出了用于结核病的算法数据融合方法。在第一项研究中,我们提出了基于多生物标记张量(MBT)的Tensor聚类框架(TCF)并将其细分通过基因组数据融合将MTBC的血统转化为亚血统。 MBT保存来自两个生物标记物,spoligotypes和MIRU模式的数据。我们使用多路模型将MBT分解为其组成矩阵。基于应变模式的成分矩阵,我们将MTBC应变聚类为子系。我们基于两种生物标志物的亚谱系新定义证实了一些现有的亚谱系,并建议细分或合并其他亚谱系。该模型使用基于这两个生物标记的三种遗传距离测度的最大简约方法。通过spoligoforest描绘的spoligotypes的推定突变历史显示出显着的拓扑属性。子代spoligotypes的数量遵循幂律分布。另外,在DR区域中每个间隔子处的突变数目遵循空间双峰分布。基于此观察,我们为突变长度频率建立了两个替代模型:起点模型(SPM)和最长块模型(LBM)。两种模型似乎都适合于spoligoforest中的突变长度频率分布。在第三项研究中,我们提出了通过基因组-表位数据融合用于结核病患者宿主-病原体关联分析的统一比色框架(UBF)。 UBF具有灵活性,因为我们可以将MTBC株之间的遗传距离,结核病患者之间的空间距离以及时间纳入领域知识,并通过耦合矩阵矩阵和矩阵张量因子分解来分解这些联合数据集。我们计算(spoligotype,country)对的特征模式相似度矩阵,并将其用作我们新颖的密度不变双聚类算法的输入。最后,我们使用平均最佳匹配分数来选择具有统计意义的双聚类。产生的双峰验证了MTBC菌株与其宿主地理分布之间的一些众所周知的宿主-病原体关联,并暗示了新的患者-菌株关系。

著录项

  • 作者

    Ozcaglar, Cagri.;

  • 作者单位

    Rensselaer Polytechnic Institute.;

  • 授予单位 Rensselaer Polytechnic Institute.;
  • 学科 Biology Genetics.;Engineering Computer.;Biology Bioinformatics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 164 p.
  • 总页数 164
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号