首页> 美国卫生研究院文献>Nucleic Acids Research >EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
【2h】

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

机译:EnsembleCNV:一种集成的机器学习算法可使用SNP数组数据识别和基因型拷贝数变异

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
机译:疾病/特征与拷贝数变异(CNV)之间的关联尚未在全基因组关联研究(GWAS)中进行系统研究,这主要是由于缺乏用于CNV基因分型的强大而准确的工具。本文中,我们提出了一种新颖的整体学习框架ensembleCNV,以使用单核苷酸多态性(SNP)阵列数据检测CNV并对其进行基因分型。 EnsembleCNV(a)在原始数据级别识别并消除批处理影响; (b)通过启发式算法将具有互补强度的多个现有呼叫者的单个CNV呼叫组合到CNV区域(CNVR); (c)使用通过多个​​CNVR的全局信息调整的局部可能性模型对每个CNVR进行基因分型; (d)通过拷贝数强度的局部相关结构细化CNVR边界; (e)提供直接的CNV基因分型和置信度评分,可直接用于下游质量控制和关联分析。以两个大型数据集为基准,ensembleCNV的表现优于竞争方法,实现了较高的检出率(93.3%)和可重复性(98.6%),同时通过捕获1000个基因组计划中记录的85%的普通CNV同时实现了高灵敏度。鉴于这种CNV检出率和准确性可与SNP基因分型相提并论,我们建议ensembleCNV在进行全基因组CNV关联研究和调查CNV易感于人类疾病方面具有广阔的前景。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号