首页> 外文OA文献 >Meta-analysis framework for peak calling by combining multiple ChIP-seq algorithms and gene clustering by combining multiple transcriptomic studies
【2h】

Meta-analysis framework for peak calling by combining multiple ChIP-seq algorithms and gene clustering by combining multiple transcriptomic studies

机译:结合多种ChIP-seq算法进行峰调用的元分析框架和结合多种转录组学研究进行基因聚类

摘要

With the availability of large amount of genomics studies, integrating information from multiple sources improves knowledge discovery. To address the complexity of genome and numerous genetic features, meta-analysis that aggregate information achieves higher statistical power for the measure of interest, and identify patterns among study results, sources of disagreement among those results.ududAs Next-Generation Sequencing (NGS) technologies are becoming affordable and can provide per-base resolution, NGS data serves as an appealing tool to analyze genomic fea-tures. Among various applications of NGS technologies, chromatin immunoprecipitation followed by high-throughput sequencing(ChIP-seq) is primarily used to provide quantitative, genome-wide mapping of target protein and DNA interaction events. Signal peak calling algorithms identified target regions of interest enriched in vitro. Despite the existing pro-grams for previous ChIP-Chip platforms, peak calling of putative protein binding sites from large, sequencing based data-sets presents a bioinformatic challenge that has required considerable computational innovation. Popular peak calling algorithms, such as MACS, SPP, CisGenome, SISSRs, USeq, and PeakSeq, are widely applied but each of them has different emphasis on sensitivity, specificity or different size and shape selection of peaks. In the first project of this dissertation, we propose a meta-analysis framework, ChIP-MetaCaller, to combine multiple top-performing algorithms to identify and reprioritize the peaks. We provide a forward selection algorithm to decide best combination of algorithms’ output to perform meta-analysis and showed that the result improves motif enrichment and sensitivity. The results are more trackable by biologists for further validation and hypothesis generation.ududThe mechanisms of complex diseases like cancers involve changes in multiple genes, each conferring small and incremental risk that potentially converge in deregulated biological pathways, cellular functions and local circuit changes. To understand this complex network requires discovery of co-expression gene modules. Literature shows using meta-analysis can improve performance of identifying these modules from machine learning techniques in some pilot studies. In the second project of this dissertation, we proposed approach which is based on the clustering results of each individual study. Combining standardized distances from genes to the medoids lead to an integrated distance matrix and perform the meta-clustering. We compared the performance of proposed approach and Meta Clustering combining distance under three simulation settings and three real data sets and provide guidance for practitioners.ududTwo projects included in this dissertation tackles different biological questions based on genomics data. Both of them improve performance from existing methods by information integration applying meta-analysis frameworks, and provide comprehensive biomarker detection.This work could improve public health by providing more effective methodologies for biomarker detection in the integration of multiple genomic studies.
机译:随着大量基因组学研究的开展,整合来自多种来源的信息可改善知识发现。为了解决基因组的复杂性和众多遗传特征,汇总信息的荟萃分析可提高感兴趣的度量的统计能力,并确定研究结果中的模式,这些结果之间的分歧源。 ud udAs下一代测序( NGS)技术正在变得负担得起,并且可以提供每个碱基的分辨率,NGS数据可以用作分析基因组特征的诱人工具。在NGS技术的各种应用中,染色质免疫沉淀后进行高通量测序(ChIP-seq)主要用于提供目标蛋白和DNA相互作用事件的全基因组定量分析。信号峰调用算法可确定在体外富集的目标靶区域。尽管以前的ChIP-Chip平台已有程序,但从基于测序的大型数据集推论推定的蛋白质结合位点仍是一个生物信息学挑战,需要大量的计算创新。流行的峰调用算法,例如MACS,SPP,CisGenome,SISSR,USeq和PeakSeq,已得到广泛应用,但是每种算法在峰的灵敏度,特异性或大小和形状选择上都有不同的侧重点。在本论文的第一个项目中,我们提出了一个荟萃分析框架ChIP-MetaCaller,该框架结合了多个性能最佳的算法来识别和重新确定峰的优先级。我们提供了一种前向选择算法,可以决定算法输出的最佳组合以执行荟萃分析,并表明该结果可以提高图案的富集度和敏感性。生物学家更容易追踪结果,以进行进一步的验证和假设的产生。 ud ud癌症等复杂疾病的机制涉及多个基因的变化,每个基因都具有较小的增量风险,可能会收敛于失控的生物途径,细胞功能和局部回路变化。要了解这个复杂的网络,需要发现共表达基因模块。文献显示,在一些试验研究中,使用荟萃分析可以提高从机器学习技术中识别这些模块的性能。在本论文的第二个项目中,我们提出了一种基于每个单独研究的聚类结果的方法。将基因到medoids的标准化距离结合起来,可以得到一个整合的距离矩阵并进行元聚类。我们在三种模拟设置和三种真实数据集下比较了所提出的方法和结合距离的元聚类的性能,并为从业者提供了指导。 ud ud本论文中的两个项目基于基因组数据解决了不同的生物学问题。两者都通过应用荟萃分析框架的信息整合来提高现有方法的性能,并提供全面的生物标志物检测。这项工作可以通过在整合多个基因组研究中提供更有效的生物标志物检测方法,从而改善公共卫生。

著录项

  • 作者

    Chen Rui;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号