首页> 外文期刊>Quality Control, Transactions >dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data
【24h】

dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

机译:DPGMM:低覆盖全基因组测序数据中的拷贝数变异检测的Dirichlet工艺高斯混合模型

获取原文
获取原文并翻译 | 示例
           

摘要

Comprehensive identification and cataloging of copy number variation (CNVs) are essential to providing a complete view of human genetic variation and to finding diseased genes. Due to the large-scale sequencing and cost control whole-genome sequencing (WGS) data, low-coverage data is favorably disposed towards CNV identification. However, such low-coverage data is sensitive to noise and sequencing biases, which results in low resolution of CNV detection in past experimental designs for WGS datasets. In this paper, we present a control-free Dirichlet process Gaussian mixture model (dpGMM) based approach, to analyze the read depth (RD) of low-coverage WGS datasets for CNV discovery. First, noise and biases of the RD signals are corrected through the preprocessing step of dpGMM. Then we assume that RD signals across genomic regions follow a Gaussian mixture model (GMM) in which each Gaussian distribution is followed by a copy number state. Without requiring the number of Gaussian distributions, dpGMM builds a Dirichlet process (DP) GMM for RD signals and further uses a DP prior to infer the number of Gaussian models. After that, we apply dpGMM to simulation datasets with different coverages and individual datasets, and compare ours to three widely used RD-based pipelines, CNVnator, GROM-RD, and BIC-seq2. Simulation results demonstrate that our approach, dpGMM, has a high F1 score in both low- and high- coverage sequences. Also, the number of overlaps between CNVs detected in real data by ours and the standard benchmark is twice as much as that detected by other tools such as CNVnator and GROM-RD.
机译:拷贝数变异(CNV)的综合识别和编目对于提供人类遗传变异的完全看法并找到患病基因至关重要。由于大规模的测序和成本控制全基因组测序(WGS)数据,低覆盖数据朝着CNV鉴定均可达到。然而,这种低覆盖数据对噪声和测序偏差敏感,这导致在过去的实验设计中的CNV检测的低分辨率,用于WGS数据集。在本文中,我们介绍了一种无控制的Dirichlet工艺高斯混合模型(DPGMM)的方法,分析了用于CNV发现的低覆盖WGS数据集的读取深度(RD)。首先,通过DPGMM的预处理步骤来校正RD信号的噪声和偏差。然后,我们假设跨基因组区域的RD信号遵循高斯混合模型(GMM),其中每个高斯分布后跟拷贝数状态。在不需要高斯分布的数量的情况下,DPGMM为RD信号构建了Dirichlet过程(DP)GMM,并且在推断高斯模型的数量之前进一步使用DP。之后,我们将DPGMM应用于具有不同覆盖范围和各个数据集的模拟数据集,并将我们的三个广泛使用的基于RD的流水线,CNVNator,Grom-Rd和BIC-SEQ2进行比较。仿真结果表明,我们的方法DPGMM在低覆盖序列中具有高F1分数。此外,由我们的实际数据中检测到的CNV与标准基准的数量是由其他工具(如CNVnator和Grom-RD)检测到的两倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号