首页> 外文会议>ACM SIGMOD International Conference on Management of Data >Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse
【24h】

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse

机译:蛇和三明治:数据仓库的最佳聚类策略

获取原文

摘要

Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40% of the queries concern calls made from some specific telephone number in some month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes. Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering in this context is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called lattice paths, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size. We additionally propose a technique called snaking, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data.
机译:数据的物理布局是数据仓库中性能的重要决定因素。磁盘上的最佳聚类,用于最小化预期的I / O取决于查询工作负载。在实践中,我们常常具有不同类别的可能性的合理意义,例如,40%的查询涉及某些特定电话号码的呼叫。在本文中,给出了在查询类上的概率分布形式的预期工作量的预期工作负载给出了查找磁盘上的事实表的最佳聚类的问题。数据仓库事实表中的属性通常具有定义在它们上的层次结构(通过辅助维度表)。尺寸层次级别的乘积形成晶格,并导致查询类的自然概念。在此上下文中的最佳聚类是一个具有庞大的搜索空间的组合爆炸性问题(层次级别的双重指数)。我们确定了一个名为晶格路径的聚类策略的重要子类,并呈现了一种动态编程算法,用于在晶格大小的时间线性中找到最佳晶格路径聚类。我们还提出了一种称为Snaking的技术,当应用于晶格路径时,总是降低其成本。对于代表性的星级模式,我们展示了为每个工作负载,有一个全球最佳的恐怖晶格路径。此外,我们证明通过向最佳格子路径施加速度而获得的聚类永远不会比全球最佳的蜿蜒晶格路径聚类更糟糕。我们补充了我们的分析,并使用TPC-D基准数据进行实验验证了我们技术的实用实用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号