首页> 美国卫生研究院文献>Database: The Journal of Biological Databases and Curation >Benchmarking distributed data warehouse solutions for storing genomic variant information
【2h】

Benchmarking distributed data warehouse solutions for storing genomic variant information

机译:对用于存储基因组变异信息的分布式数据仓库解决方案进行基准测试

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could be the application of modern distributed storage systems and query engines. However, the application of large genomic variant databases to this problem has not been sufficiently far explored so far in the literature. To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated with large generated content of genomic variants and phenotypic data. Next, we have benchmarked performance of a number of combinations of distributed storages and query engines on a set of SQL queries that address biological questions essential for both research and medical applications. In addition, a non-distributed, analytical database (MonetDB) has been used as a baseline. Comparison of query execution times confirms that distributed data warehousing solutions outperform classic relational DBMSs. Moreover, pre-aggregation and further denormalization of data, which reduce the number of distributed join operations, significantly improve query performance by several orders of magnitude. Most of distributed back-ends offer a good performance for complex analytical queries, while the Optimized Row Columnar (ORC) format paired with Presto and Parquet with Spark 2 query engines provide, on average, the lowest execution times. Apache Kudu on the other hand, is the only solution that guarantees a sub-second performance for simple genome range queries returning a small subset of data, where low-latency response is expected, while still offering decent performance for running analytical queries. In summary, research and clinical applications that require the storage and analysis of variants from thousands of samples can benefit from the scalability and performance of distributed data warehouse solutions. >Database URL:
机译:基于基因组的个性化医学包括存储,分析和解释基因组变异作为其中心问题。在成千上万的患者测序外显子组和基因组的时代,对有效的数据库存储和查询的需求日益增长。答案可能是现代分布式存储系统和查询引擎的应用。但是,迄今为止,在文献中尚未对大型基因组变体数据库对该问题的应用进行足够的探索。为了研究现代柱状存储[面向列的数据库管理系统(DBMS)]和查询引擎的有效性,我们开发了原型基因组变体数据仓库,其中填充了大量生成的基因组变体和表型数据。接下来,我们对一组SQL查询中的分布式存储和查询引擎组合的性能进行了基准测试,这些SQL查询解决了研究和医学应用必不可少的生物学问题。此外,非分布式分析数据库(MonetDB)已用作基准。查询执行时间的比较证实了分布式数据仓库解决方案的性能优于传统的关系型DBMS。此外,预聚合和进一步的数据非规范化可以减少分布式联接操作的数量,从而将查询性能显着提高几个数量级。大多数分布式后端都为复杂的分析查询提供了良好的性能,而与Presto和Parquet与Spark 2查询引擎配对的Optimized Row Columnar(ORC)格式平均提供了最短的执行时间。另一方面,Apache Kudu是唯一可保证简单基因组范围查询返回亚小数据集的亚秒级性能的解决方案,该数据集可期待低延迟响应,同时仍能提供运行分析查询的出色性能。总之,需要存储和分析来自数千个样本的变体的研究和临床应用可以受益于分布式数据仓库解决方案的可伸缩性和性能。 >数据库网址:

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号