【24h】

Snakes and sandwiches

机译:蛇和三明治

获取原文

摘要

Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40% of the queries concern calls made from some specific telephone number in some month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes.

Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering in this context is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called lattice paths, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size. We additionally propose a technique called snaking, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data.

机译:

数据的物理布局是数据仓库中性能的关键决定因素。磁盘上数据的最佳集群以最大程度地减少预期的I / O取决于查询工作负载。在实践中,我们通常对不同类别的查询的可能性有一个合理的认识,例如,40%的查询涉及在 some 个月中从 some 特定电话号码发出的呼叫。在本文中,我们解决了在给定的预期工作量(以查询类的概率分布的形式)的情况下,在磁盘上找到事实表的记录的最佳聚类的问题。

数据仓库事实表中的属性通常具有在其上定义的层次结构(通过辅助维表)。维层次结构层次的乘积形成一个格并导致查询类的自然概念。在这种情况下,最佳聚类是一个组合爆炸性问题,具有巨大的搜索空间(层次结构级别的数量成倍增加)。我们确定了称为晶格路径的聚类策略的重要子类,并提出了一种动态规划算法,用于寻找最优的晶格路径聚类,并且在晶格大小上呈线性关系。我们还提出了一种称为 snaking 的技术,该技术在应用于晶格路径时始终可以降低其成本。对于星型模式的代表性类,我们表明对于每个工作负载,都有一条蛇形的晶格路径,该路径是全局最优的。此外,我们证明了通过对最佳晶格路径应用蛇形获得的聚类永远不会比全局最佳蛇形晶格路径聚类差很多。我们对分析进行了补充,并使用TPC-D基准数据进行了实验,验证了我们技术的实用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号