An efficient and scalable SPARK preprocessing methodology for Genome Wide Association Studies

机译：用于全基因组关联研究的高效，可扩展的SPARK预处理方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The importance of the use of high-performance software frameworks to analyze omics data obtained by using High-Throughput (HT) essays is widely recognized. HT methodologies comprise microarrays, Genome-Wide Association Studies (GWAS), and Next Generation Sequencing (NGS), which provide a vast amount of data per a single experiment. Each HT vendor provides to the users only the software frameworks and the proprietary libraries for the annotation, and summarization of raw data. Consequently, the needs of algorithms for the preprocessing and analysis of omics data arise. GWAS aims to highlight the association between genetic variants and diseases by examining single nucleotide polymorphisms (SNPs), which differ in a statistically significant way between cases and controls. The effectiveness of GWAS analysis increases with the number of analyzed samples per single experiment. GWAS data analyzed through the use of statistical methods can detect associations among a single allelic variant and the clinical conditions of samples. To overcome these limitations, and to make it possible to discover multiple associations among allelic variants, it is possible to use Association Rules mining. Consequently, the need for the introduction of scalable Association Rule Mining (ARM) algorithms able to analyze GWAS data arises. Hence, the use of high-performance data analytics framework is needed. For this purpose, we propose a software framework called GARMS (GWAS Association Rule Mining in Spark) built on top of Apache Spark for the preprocessing, and mining of association rules from GWAS data sets. GARMS comprises a two steps analysis methodology: (i) in the first step, the GWAS data are preprocessed, along with the identification of the frequent itemsets; (ii) in the second step, frequent itemsets are employed to mine association rules without scanning the input data. We implemented our algorithm, and we tested it on some synthetic GWAS data sets. Preliminary results confirm that our method may extract relevant association rules from GWAS data reducing the computational time.

机译：众所周知，使用高性能软件框架来分析通过高通量（HT）论文获得的组学数据的重要性。 HT方法包括微阵列，全基因组关联研究（GWAS）和下一代测序（NGS），它们在单个实验中提供大量数据。每个HT供应商仅向用户提供用于注释和原始数据汇总的软件框架和专有库。因此，出现了需要对组学数据进行预处理和分析的算法。 GWAS旨在通过检查单核苷酸多态性（SNP）来强调遗传变异与疾病之间的关联，单核苷酸多态性在病例和对照之间在统计学上具有重要意义。 GWAS分析的有效性随每个实验中分析的样品数量的增加而增加。通过使用统计方法分析的GWAS数据可以检测单个等位基因变体与样品临床状况之间的关联。为了克服这些限制，并有可能发现等位基因变体之间的多个关联，可以使用关联规则挖掘。因此，需要引入能够分析GWAS数据的可伸缩关联规则挖掘（ARM）算法。因此，需要使用高性能数据分析框架。为此，我们提出了一个称为GARMS（Spark中的GWAS关联规则挖掘）的软件框架，该框架基于Apache Spark进行预处理，并从GWAS数据集中挖掘关联规则。 GARMS包括两步分析方法：（i）第一步，对GWAS数据进行预处理，并识别频繁项集; （ii）在第二步中，不扫描输入数据就使用频繁项集来挖掘关联规则。我们实现了我们的算法，并在一些综合GWAS数据集上对其进行了测试。初步结果证实我们的方法可以从GWAS数据中提取相关的关联规则，从而减少了计算时间。

著录项

来源
《Euromicro International Conference on Parallel, Distributed and Network-Based Processing》|2020年|369-375|共7页
会议地点
作者
Giuseppe Agapito; Pietro Hiram Guzzi; Mario Cannataro;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Sparks; Task analysis; Data mining; Bioinformatics; Distributed databases; Computational modeling; Google;

机译：Sparks;任务分析;数据挖掘;生物信息学;分布式数据库;计算建模; Google;

相似文献

外文文献
中文文献
专利

1. Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge [J] . Fei Yu, Zhanglong Ji BMC Medical Informatics and Decision Making . 2014,第SUPPLEMENTa1期

机译：用于全基因组关联研究的可扩展隐私保护数据共享方法：在iDASH医疗保健隐私保护挑战中的应用
2. Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge [J] . Fei Yu, Zhanglong Ji BMC Medical Informatics and Decision Making . 2014,第SUPPLEMENTa1期

机译：用于全基因组关联研究的可扩展隐私保护数据共享方法：在iDASH医疗保健隐私保护挑战中的应用
3. Scalable privacy-preserving data sharing methodology for genome-wide association studies [J] . Fei Yu, Stephen E. Fienberg, Aleksra B. Slavkovic, Journal of biomedical informatics. . 2014,第Null期

机译：用于全基因组关联研究的可扩展的隐私保护数据共享方法
4. An efficient and scalable SPARK preprocessing methodology for Genome Wide Association Studies [C] . Giuseppe Agapito, Pietro Hiram Guzzi, Mario Cannataro Euromicro International Conference on Parallel, Distributed and Network-Based Processing . 2020

机译：基因组宽协会研究的高效且可扩展的火花预处理方法
5. Randomized Fixed Model (RFM) Methodology for Genome-Wide Association Study [D] . Sigdel, Sakar. 2018

机译：全基因组关联研究的随机固定模型（RFM）方法
6. Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge [O] . Fei Yu, Zhanglong Ji 2014

机译：用于全基因组关联研究的可扩展隐私保护数据共享方法：在iDASH医疗保健隐私保护挑战中的应用
7. Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge [O] . Fei Yu, Zhanglong Ji 2014

机译：用于全基因组关联研究的可扩展隐私保护数据共享方法：在iDASH医疗保健隐私保护挑战中的应用

An efficient and scalable SPARK preprocessing methodology for Genome Wide Association Studies

摘要

著录项

相似文献

相关主题

期刊订阅