首页> 外文期刊>Nucleic Acids Research >EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
【24h】

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

机译:Ensemblecnv:使用SNP阵列数据识别和基因型拷贝数变型的集合机器学习算法

获取原文
获取原文并翻译 | 示例
           

摘要

The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
机译:疾病/特征和拷贝数变体(CNVs)之间的关联尚未在基因组 - 范围的协会研究(GWASS)中进行系统地研究,主要是由于缺乏CNV基因分型的鲁棒和准确的工具。在此,我们提出了一种新的集合学习框架,Ensemblecnv,使用单核苷酸多态性(SNP)阵列数据来检测和基因型CNV。 Ensemblecnv(a)识别并消除原始数据级别的批处理效果; (b)将单独的CNV调用与具有启发式算法的互补强度的多个现有呼叫者组合成CNV区(CNVR); (c)将每个CNVR的重新基因分型具有通过多个CNVR的全球信息调整的本地似然模型; (d)通过拷贝数强度的局部相关结构来细化CNVR边界; (e)提供直接CNV基因分型伴随着置信度评分,可直接访问下游质量控制和关联分析。在两个大型数据集中进行基准测试,Ensemblecnv优于竞争方法,并实现了高通话率(93.3%)和再现性(98.6%),同时通过捕获1000个基因组项目中记录的85%的常见CNV同时实现高灵敏度。鉴于这种CNV呼叫率和准确性,其与SNP基因分型相当,我们建议Ensemblecnv对进行基因组的CNV关联研究并调查CNVS如何易于对人类疾病的影响。

著录项

  • 来源
    《Nucleic Acids Research》 |2019年第7期|共13页
  • 作者单位

    Icahn Sch Med Mt Sinai Dept Genet &

    Genom Sci New York NY 10029 USA;

    Icahn Sch Med Mt Sinai Dept Genet &

    Genom Sci New York NY 10029 USA;

    Johns Hopkins Univ Bloomberg Sch Publ Hlth Ctr Early Life Origins Dis Dept Populat Family &

    Reprod Hlth Baltimore MD 21205 USA;

    Icahn Sch Med Mt Sinai Dept Genet &

    Genom Sci New York NY 10029 USA;

    Karolinska Univ Jukhuset Integrated Cardio Metab Ctr Karolinska Inst Huddinge Sweden;

    Icahn Sch Med Mt Sinai Dept Genet &

    Genom Sci New York NY 10029 USA;

    Tartu Univ Hosp Dept Cardiac Surg Tartu Estonia;

    Icahn Sch Med Mt Sinai Cardiovasc Res Ctr New York NY 10029 USA;

    Icahn Sch Med Mt Sinai Dept Genet &

    Genom Sci New York NY 10029 USA;

    Johns Hopkins Univ Bloomberg Sch Publ Hlth Ctr Early Life Origins Dis Dept Populat Family &

    Reprod Hlth Baltimore MD 21205 USA;

    Icahn Sch Med Mt Sinai Dept Genet &

    Genom Sci New York NY 10029 USA;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 生物化学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号