首页> 外文学位 >Mining emerging massive scientific sequence data using block-wise decomposition methods.
【24h】

Mining emerging massive scientific sequence data using block-wise decomposition methods.

机译:使用逐块分解方法挖掘新兴的大量科学序列数据。

获取原文
获取原文并翻译 | 示例

摘要

I present efficient data mining algorithms for knowledge discovery on two types of emerging large-scale sequence-based scientific datasets: (1) static sequence data generated from SNP diversity arrays for genomic studies, and (2) dynamic sequence data collected in streaming and sensor network systems for environmental studies. The massive, noisy nature of the SNP arrays and the distributive, online nature of sensor network data pose challenging issues for knowledge discovery such as scalability, robustness, and efficiency. Despite the different characteristics of the SNP arrays and streaming sensor data, when viewed as sequences of ordered observations, both can be efficiently mined using algorithms based on block-wise decomposition methods.;I present models and mining algorithms for inferring the genetic variation structure in genome-wide Single-Nucleotide Polymorphism (SNP) arrays. Genome-wide SNP arrays provide a comprehensive view of genome variation and serve as powerful resources for genetic and biomedical studies. Understanding the patterns of genetic variation in a population of individuals plays an important role in solving many genetics problems such as genealogy reconstruction and gene association studies. In this thesis, I propose data mining models and algorithms to efficiently infer genetic variation structure from the massive SNP panels of recombinant sequences resulting from meiotic recombination. I introduced the Minimum Segmentation Problem (MSP) to infer the segmentation structure of a single recombinant strain, as well as the Minimum Mosaic Problem (MMP) to infer the mosaic structure on a panel of recombinant strains. Both MSP and MMP estimate the ancestral polymorphism patterns exhibited in recombinant strains which provides important inputs for the subsequent association analysis. Efficient dynamic programming and graph algorithms based on block-wise decomposition are proposed which can solve MSP and MMP on genome-wide large-scale panels.;I present efficient algorithms for mining massive streaming and sensor network data for observational sciences such as ecology and environmental studies. I proposed efficient algoirithms using block-wise synopsis construction to capture the data distribution online for the dynamic sequence data collected in the sensor network and streaming systems including clustering analysis and order-statistics computation, which is critical for real-time monitoring, anomaly detection, and other domain specific analysis.
机译:我介绍了用于在两种新型的大规模基于序列的科学数据集上进行知识发现的有效数据挖掘算法:(1)从SNP分集阵列生成的静态序列数据用于基因组研究,以及(2)在流和传感器中收集的动态序列数据用于环境研究的网络系统。 SNP阵列的巨大,嘈杂的特性以及传感器网络数据的分布式,在线特性为知识发现提出了具有挑战性的问题,例如可伸缩性,鲁棒性和效率。尽管SNP阵列和流式传感器数据具有不同的特征,但当以有序观察序列的形式查看时,都可以使用基于块分解方法的算法有效地挖掘两者。我提出了用于推断遗传变异结构的模型和挖掘算法。全基因组单核苷酸多态性(SNP)阵列。全基因组SNP阵列提供了基因组变异的全面视图,并为遗传和生物医学研究提供了强大的资源。了解个体群体中遗传变异的模式在解决许多遗传学问题(例如家谱重建和基因关联研究)中起着重要作用。在本文中,我提出了数据挖掘模型和算法,以有效地从减数分裂重组产生的重组序列的大规模SNP面板中推断遗传变异结构。我介绍了最小分割问题(​​MSP)来推断单个重组菌株的分割结构,以及最小镶嵌问题(MMP)来推断一组重组菌株上的镶嵌结构。 MSP和MMP都估计重组菌株中所显示的祖先多态性模式,这为后续的关联分析提供了重要的输入。提出了一种基于块分解的有效动态规划和图算法,可以解决全基因组大规模面板上的MSP和MMP问题。我提出了用于挖掘生态和环境等观测科学的大量流和传感器网络数据的有效算法。学习。我提出了使用逐块概要构造的有效算法,以在线捕获传感器网络和流系统中收集的动态序列数据的数据分布,包括聚类分析和阶次统计计算,这对于实时监控,异常检测,以及其他特定领域的分析。

著录项

  • 作者

    Zhang, Qi.;

  • 作者单位

    The University of North Carolina at Chapel Hill.;

  • 授予单位 The University of North Carolina at Chapel Hill.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 143 p.
  • 总页数 143
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号