首页> 外文会议>Euromicro International Conference on Parallel, Distributed and Network-Based Processing >An efficient and scalable SPARK preprocessing methodology for Genome Wide Association Studies
【24h】

An efficient and scalable SPARK preprocessing methodology for Genome Wide Association Studies

机译:用于全基因组关联研究的高效,可扩展的SPARK预处理方法

获取原文

摘要

The importance of the use of high-performance software frameworks to analyze omics data obtained by using High-Throughput (HT) essays is widely recognized. HT methodologies comprise microarrays, Genome-Wide Association Studies (GWAS), and Next Generation Sequencing (NGS), which provide a vast amount of data per a single experiment. Each HT vendor provides to the users only the software frameworks and the proprietary libraries for the annotation, and summarization of raw data. Consequently, the needs of algorithms for the preprocessing and analysis of omics data arise. GWAS aims to highlight the association between genetic variants and diseases by examining single nucleotide polymorphisms (SNPs), which differ in a statistically significant way between cases and controls. The effectiveness of GWAS analysis increases with the number of analyzed samples per single experiment. GWAS data analyzed through the use of statistical methods can detect associations among a single allelic variant and the clinical conditions of samples. To overcome these limitations, and to make it possible to discover multiple associations among allelic variants, it is possible to use Association Rules mining. Consequently, the need for the introduction of scalable Association Rule Mining (ARM) algorithms able to analyze GWAS data arises. Hence, the use of high-performance data analytics framework is needed. For this purpose, we propose a software framework called GARMS (GWAS Association Rule Mining in Spark) built on top of Apache Spark for the preprocessing, and mining of association rules from GWAS data sets. GARMS comprises a two steps analysis methodology: (i) in the first step, the GWAS data are preprocessed, along with the identification of the frequent itemsets; (ii) in the second step, frequent itemsets are employed to mine association rules without scanning the input data. We implemented our algorithm, and we tested it on some synthetic GWAS data sets. Preliminary results confirm that our method may extract relevant association rules from GWAS data reducing the computational time.
机译:众所周知,使用高性能软件框架来分析通过高通量(HT)论文获得的组学数据的重要性。 HT方法包括微阵列,全基因组关联研究(GWAS)和下一代测序(NGS),它们在单个实验中提供大量数据。每个HT供应商仅向用户提供用于注释和原始数据汇总的软件框架和专有库。因此,出现了需要对组学数据进行预处理和分析的算法。 GWAS旨在通过检查单核苷酸多态性(SNP)来强调遗传变异与疾病之间的关联,单核苷酸多态性在病例和对照之间在统计学上具有重要意义。 GWAS分析的有效性随每个实验中分析的样品数量的增加而增加。通过使用统计方法分析的GWAS数据可以检测单个等位基因变体与样品临床状况之间的关联。为了克服这些限制,并有可能发现等位基因变体之间的多个关联,可以使用关联规则挖掘。因此,需要引入能够分析GWAS数据的可伸缩关联规则挖掘(ARM)算法。因此,需要使用高性能数据分析框架。为此,我们提出了一个称为GARMS(Spark中的GWAS关联规则挖掘)的软件框架,该框架基于Apache Spark进行预处理,并从GWAS数据集中挖掘关联规则。 GARMS包括两步分析方法:(i)第一步,对GWAS数据进行预处理,并识别频繁项集; (ii)在第二步中,不扫描输入数据就使用频繁项集来挖掘关联规则。我们实现了我们的算法,并在一些综合GWAS数据集上对其进行了测试。初步结果证实我们的方法可以从GWAS数据中提取相关的关联规则,从而减少了计算时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号