Benchmarking distributed data warehouse solutions for storing genomic variant information

机译：对用于存储基因组变异信息的分布式数据仓库解决方案进行基准测试

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could be the application of modern distributed storage systems and query engines. However, the application of large genomic variant databases to this problem has not been sufficiently far explored so far in the literature. To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated with large generated content of genomic variants and phenotypic data. Next, we have benchmarked performance of a number of combinations of distributed storages and query engines on a set of SQL queries that address biological questions essential for both research and medical applications. In addition, a non-distributed, analytical database (MonetDB) has been used as a baseline. Comparison of query execution times confirms that distributed data warehousing solutions outperform classic relational DBMSs. Moreover, pre-aggregation and further denormalization of data, which reduce the number of distributed join operations, significantly improve query performance by several orders of magnitude. Most of distributed back-ends offer a good performance for complex analytical queries, while the Optimized Row Columnar (ORC) format paired with Presto and Parquet with Spark 2 query engines provide, on average, the lowest execution times. Apache Kudu on the other hand, is the only solution that guarantees a sub-second performance for simple genome range queries returning a small subset of data, where low-latency response is expected, while still offering decent performance for running analytical queries. In summary, research and clinical applications that require the storage and analysis of variants from thousands of samples can benefit from the scalability and performance of distributed data warehouse solutions. >Database URL:

机译：基于基因组的个性化医学包括存储，分析和解释基因组变异作为其中心问题。在成千上万的患者测序外显子组和基因组的时代，对有效的数据库存储和查询的需求日益增长。答案可能是现代分布式存储系统和查询引擎的应用。但是，迄今为止，在文献中尚未对大型基因组变体数据库对该问题的应用进行足够的探索。为了研究现代柱状存储[面向列的数据库管理系统（DBMS）]和查询引擎的有效性，我们开发了原型基因组变体数据仓库，其中填充了大量生成的基因组变体和表型数据。接下来，我们对一组SQL查询中的分布式存储和查询引擎组合的性能进行了基准测试，这些SQL查询解决了研究和医学应用必不可少的生物学问题。此外，非分布式分析数据库（MonetDB）已用作基准。查询执行时间的比较证实了分布式数据仓库解决方案的性能优于传统的关系型DBMS。此外，预聚合和进一步的数据非规范化可以减少分布式联接操作的数量，从而将查询性能显着提高几个数量级。大多数分布式后端都为复杂的分析查询提供了良好的性能，而与Presto和Parquet与Spark 2查询引擎配对的Optimized Row Columnar（ORC）格式平均提供了最短的执行时间。另一方面，Apache Kudu是唯一可保证简单基因组范围查询返回亚小数据集的亚秒级性能的解决方案，该数据集可期待低延迟响应，同时仍能提供运行分析查询的出色性能。总之，需要存储和分析来自数千个样本的变体的研究和临床应用可以受益于分布式数据仓库解决方案的可伸缩性和性能。 >数据库网址：

著录项

期刊名称 Database: The Journal of Biological Databases and Curation
作者
Marek S. Wiewiórka; Dawid P. Wysakowicz; Michał J. Okoniewski; Tomasz Gambin;
展开▼
作者单位

展开▼
年(卷),期 2017(2017),-1
年度 2017
页码 bax049
总页数 16
原文格式 PDF
正文语种
中图分类生物学;
关键词

相似文献

外文文献
中文文献
专利

1. Hengam a MapReduce-Based Distributed Data Warehouse for Big Data: A MapReduce-Based Distributed Data Warehouse for Big Data [J] . Mohammadhossein Barkhordari, Mahdi Niamanesh International journal of artificial life research . 2018,第1期

机译：Hengam基于MapReduce的大数据分布式数据仓库：基于MapReduce的大数据分布式数据仓库
2. Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse [J] . Tian Yuanyuan, Ozcan Fatma, Zou Tao, ACM transactions on database systems . 2016,第4期

机译：构建混合仓库：HDFS中存储的数据与企业仓库之间的有效联接
3. CO2 and O2 solubility and diffusivity data in food products stored in data warehouse structured by ontology [J] . Valérie Guillard, Patrice Buche, Juliette Dibie, Data in Brief . 2016,第1期

机译：CO 2 和O 2 在由以下组成的数据仓库中存储的食品中的溶解度和扩散数据本体论
4. S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse [C] . Rado Ratsimbazafy, Omar Boussaid, Fadila Bentayeb International conference on big data analytics and knowledge discovery . 2017

机译：S2D：共享的分布式数据集，在分布式数据仓库中存储共享数据以进行多个和大规模的查询优化
5. A data warehouse solution: A fund-raising data warehouse [D] . Bei, Joyce Yuan 2010

机译：数据仓库解决方案：筹款数据仓库
6. CO2 and O2 solubility and diffusivity data in food products stored in data warehouse structured by ontology [O] . Valérie Guillard, Patrice Buche, Juliette Dibie, 2016

机译：通过本体构建的数据仓库中存储的食品中的CO2和O2溶解度和扩散率数据
7. Benchmark for OLAP on NoSQL technologies comparing NoSQL multidimensional data warehousing solutions [O] . Max Chevalier, Mohammed El Malki, Arlind Kopliku, 2015

机译：NoSQL技术对OLAP的基准，比较NoSQL多维数据仓储解决方案

Benchmarking distributed data warehouse solutions for storing genomic variant information

摘要

著录项

相似文献

相关主题

期刊订阅