Optimizing Multiple Queries on Scientific Datasets with Partial Replicas

机译：使用部分复制品优化科学数据集的多个查询

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose strategies to efficiently execute a query workload, which consists of multiple related queries submitted against a scientific dataset, on a distributed-memory system in the presence of partial dataset replicas. Partial replication re-organizes and re-distributes one or more subsets of a dataset across the storage system to reduce I/O overheads and increase I/O parallelism. Our work targets a class of queries, called range queries, in which the query predicate specifies lower and upper bounds on the values of all or a subset of attributes of a dataset. Data elements whose attribute values fall into the specified bounds are retrieved from the dataset. If we think of the attributes of a dataset forming multi-dimensional space, where each attribute corresponds to one of the dimensions, a range query defines a bounding box in this multi-dimensional space. We evaluate our strategies in two scenarios involving range queries. The first scenario represents the case in which queries have overlapping regions of interest, such as those arising from an exploratory analysis of the dataset by multiple users. In the second scenario, queries represent adjacent rectilinear sections that capture an irregular subregion in the multi-dimensional space. This scenario corresponds to a case where the user wants to query and retrieve a spatial feature from the dataset. We propose cost models and an algorithm for optimizing such queries. Our results using queries for subsetting and analysis of medical image datasets show that effective use of partial replicas can result in reduction in query execution times.

机译：我们提出了有效执行查询工作负载的策略，该工作负载包括在存在部分数据集副本存在的分布式存储系统上提交的多个相关查询。部分复制重新组织并重新分发了存储系统的数据集的一个或多个子集，以减少I / O开销并增加I / O并行性。我们的工作目标是一类查询，称为范围查询，其中查询谓词在数据集的全部或属性子集的值的值上指定较低和上限。数据元素从数据集检索属性值属于指定边界。如果我们考虑形成多维空间的数据集的属性，其中每个属性对应于其中一个维度，则范围查询定义了该多维空间中的边界框。我们在涉及范围查询的两种情况下评估我们的策略。第一场景表示查询具有重叠感兴趣区域的情况，例如由多个用户对数据集的探索性分析引起的那些。在第二场景中，查询表示相邻的直线部分，其捕获多维空间中的不规则子区域。这种情况对应于用户想要查询和检索数据集的空间特征的情况。我们提出了成本模型和用于优化此类查询的算法。我们的结果，使用用于子集和分析的医学图像数据集的查询表明，有效使用部分副本可以降低查询执行时间。

著录项

来源
《IEEE/ACM International Conference on Grid Computer》|2007年||共8页
会议地点
作者
Li Weng; Umit Catalyurek; Tahsin Kurc; Gagan Agrawal; Joel Saltz;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词

相似文献

外文文献
中文文献
专利

1. Model and prototype for querying multiple linked scientific datasets [J] . Peter Ansell Future generation computer systems . 2011,第3期

机译：用于查询多个链接的科学数据集的模型和原型
2. A scalable framework for continuous query evaluations over multidimensional, scientific datasets [J] . Tolooee Cameron, Malensek Matthew, Pallickara Sangmi Lee Concurrency and computation: practice and experience . 2016,第8期

机译：用于多维科学数据集上连续查询评估的可扩展框架
3. Hierarchical Read-Write Optimizations for Scientific Applications with Multi-variable Structured Datasets [J] . Preeti Malakar, Venkatram Vishwanath International journal of parallel programming . 2017,第1期

机译：具有多变量结构化数据集的科学应用的分层读写优化
4. Optimizing Multiple Queries on Scientific Datasets with Partial Replicas [C] . Li Weng, Umit Catalyurek, Tahsin Kurc, IEEE/ACM International Conference on Grid Computer . 2007

机译：使用部分复制品优化科学数据集的多个查询
5. Partial replica location and selection for spatial datasets [D] . Tian, Yun 2013

机译：空间数据集的部分副本位置和选择
6. Using a Seed-Network to Query Multiple Large-Scale Gene Expression Datasets from the Developing Retina in Order to Identify and Prioritize Experimental Targets [O] . Laura A. Hecker, Timothy C. Alcon, Vasant G. Honavar, 2008

机译：使用种子网络从发育中的视网膜查询多个大规模基因表达数据集以便识别和确定实验目标的优先级
7. Servicing Range Queries on Multidimensional Datasets with Partial Replicas [O] . Li Weng, Umit Catalyurek, Tahsin Kurc, 2005

机译：具有部分副本的多维数据集的服务范围查询

Optimizing Multiple Queries on Scientific Datasets with Partial Replicas

摘要

著录项

相似文献

相关主题

期刊订阅