首页> 外文会议>IEEE/ACM International Conference on Grid Computer >Optimizing Multiple Queries on Scientific Datasets with Partial Replicas
【24h】

Optimizing Multiple Queries on Scientific Datasets with Partial Replicas

机译:使用部分复制品优化科学数据集的多个查询

获取原文

摘要

We propose strategies to efficiently execute a query workload, which consists of multiple related queries submitted against a scientific dataset, on a distributed-memory system in the presence of partial dataset replicas. Partial replication re-organizes and re-distributes one or more subsets of a dataset across the storage system to reduce I/O overheads and increase I/O parallelism. Our work targets a class of queries, called range queries, in which the query predicate specifies lower and upper bounds on the values of all or a subset of attributes of a dataset. Data elements whose attribute values fall into the specified bounds are retrieved from the dataset. If we think of the attributes of a dataset forming multi-dimensional space, where each attribute corresponds to one of the dimensions, a range query defines a bounding box in this multi-dimensional space. We evaluate our strategies in two scenarios involving range queries. The first scenario represents the case in which queries have overlapping regions of interest, such as those arising from an exploratory analysis of the dataset by multiple users. In the second scenario, queries represent adjacent rectilinear sections that capture an irregular subregion in the multi-dimensional space. This scenario corresponds to a case where the user wants to query and retrieve a spatial feature from the dataset. We propose cost models and an algorithm for optimizing such queries. Our results using queries for subsetting and analysis of medical image datasets show that effective use of partial replicas can result in reduction in query execution times.
机译:我们提出了有效执行查询工作负载的策略,该工作负载包括在存在部分数据集副本存在的分布式存储系统上提交的多个相关查询。部分复制重新组织并重新分发了存储系统的数据集的一个或多个子集,以减少I / O开销并增加I / O并行性。我们的工作目标是一类查询,称为范围查询,其中查询谓词在数据集的全部或属性子集的值的值上指定较低和上限。数据元素从数据集检索属性值属于指定边界。如果我们考虑形成多维空间的数据集的属性,其中每个属性对应于其中一个维度,则范围查询定义了该多维空间中的边界框。我们在涉及范围查询的两种情况下评估我们的策略。第一场景表示查询具有重叠感兴趣区域的情况,例如由多个用户对数据集的探索性分析引起的那些。在第二场景中,查询表示相邻的直线部分,其捕获多维空间中的不规则子区域。这种情况对应于用户想要查询和检索数据集的空间特征的情况。我们提出了成本模型和用于优化此类查询的算法。我们的结果,使用用于子集和分析的医学图像数据集的查询表明,有效使用部分副本可以降低查询执行时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号