Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse

机译：蛇和三明治：数据仓库的最佳聚类策略

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40% of the queries concern calls made from some specific telephone number in some month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes. Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering in this context is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called lattice paths, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size. We additionally propose a technique called snaking, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data.

机译：数据的物理布局是数据仓库中性能的重要决定因素。磁盘上的最佳聚类，用于最小化预期的I / O取决于查询工作负载。在实践中，我们常常具有不同类别的可能性的合理意义，例如，40％的查询涉及某些特定电话号码的呼叫。在本文中，给出了在查询类上的概率分布形式的预期工作量的预期工作负载给出了查找磁盘上的事实表的最佳聚类的问题。数据仓库事实表中的属性通常具有定义在它们上的层次结构（通过辅助维度表）。尺寸层次级别的乘积形成晶格，并导致查询类的自然概念。在此上下文中的最佳聚类是一个具有庞大的搜索空间的组合爆炸性问题（层次级别的双重指数）。我们确定了一个名为晶格路径的聚类策略的重要子类，并呈现了一种动态编程算法，用于在晶格大小的时间线性中找到最佳晶格路径聚类。我们还提出了一种称为Snaking的技术，当应用于晶格路径时，总是降低其成本。对于代表性的星级模式，我们展示了为每个工作负载，有一个全球最佳的恐怖晶格路径。此外，我们证明通过向最佳格子路径施加速度而获得的聚类永远不会比全球最佳的蜿蜒晶格路径聚类更糟糕。我们补充了我们的分析，并使用TPC-D基准数据进行实验验证了我们技术的实用实用性。

著录项

来源
《ACM SIGMOD International Conference on Management of Data》|1999年||共12页
会议地点
作者
H. V. Jagadish; Laks V. S. Lakshmanan; Divesh Srivastava; Divesh Srivastava;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-532;
关键词

相似文献

外文文献
中文文献
专利

1. An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering [J] . S. R. Balasundaram, S. Vengadeswaran International journal of ambient computing and intelligence . 2018,第3期

机译：使用图聚类提高海量数据应用系统性能的最佳数据放置策略
2. Systematic analysis of snake neurotoxins' functional classification using a data warehousing approach. [J] . Siew JP, Khan AM, Tan PT, Bioinformatics . 2004,第18期

机译：使用数据仓库方法对蛇神经毒素的功能分类进行系统分析。
3. Systematic analysis of snake neurotoxins' functional classification using a data warehousing approach. [J] . Siew JP, Khan AM, Tan PT, Bioinformatics . 2004,第18期

机译：使用数据仓库方法对蛇神经毒素的功能分类进行系统分析。
4. Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse [C] . H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava, ACM SIGMOD International Conference on Management of Data . 1999

机译：蛇和三明治：数据仓库的最佳聚类策略
5. Data warehousing for social networking and human mobility research and detecting real-world events in space and time using feature clustering . [D] . Pawling, Alec. 2011

机译：用于社交网络和人员流动性研究的数据仓库，以及使用特征聚类检测时空中的现实事件。
6. The Optimally Designed Variational Autoencoder Networks for Clustering and Recovery of Incomplete Multimedia Data [O] . Xiulan Yu, Hongyu Li, Zufan Zhang, 2019

机译：针对不完整多媒体数据的聚类和恢复的优化设计的变分自动编码器网络
7. Strategy and methodology for enterprise data warehouse development. Integrating data mining and social networking techniques for identifying different communities within the data warehouse. [O] . Rifaie Mohammad 2010

机译：企业数据仓库开发的策略和方法。集成了数据挖掘和社交网络技术，以识别数据仓库内的不同社区。

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse

摘要

著录项

相似文献

相关主题

期刊订阅