...
首页> 外文期刊>The VLDB journal >Hierarchical clustering for OLAP: the CUBE File approach
【24h】

Hierarchical clustering for OLAP: the CUBE File approach

机译:OLAP的分层群集:多维数据集文件方法

获取原文
获取原文并翻译 | 示例

摘要

This paper deals with the problem of physical clustering of multidimensional data that are organized in hierarchies on disk in a hierarchy-preserving manner. This is called hierarchical clustering. A typical case, where hierarchical clustering is necessary for reducing I/Os during query evaluation, is the most detailed data of an OLAP cube. The presence of hierarchies in the multidimensional space results in an enormous search space for this problem. We propose a representation of the data space that results in a chunk-tree representation of the cube. The model is adaptive to the cube's extensive sparseness and provides efficient access to subsets of data based on hierarchy value combinations. Based on this representation of the search space we formulate the problem as a chunk-to-bucket allocation problem, which is a packing problem as opposed to the linear ordering approach followed in the literature. We propose a metric to evaluate the quality of hierarchical clustering achieved (i.e., evaluate the solutions to the problem) and formulate the problem as an optimization problem. We prove its NP-Hardness and provide an effective solution based on a linear time greedy algorithm. The solution of this problem leads to the construction of the CUBE File data structure. We analyze in depth all steps of the construction and provide solutions for interesting sub-problems arising, such as the formation of bucket-regions, the storage of large data chunks and the caching of the upper nodes (root directory) in main memory. Finally, we provide an extensive experimental evaluation of the CUBE File's adaptability to the data space sparseness as well as to an increasing number of data points. The main result is that the CUBE File is highly adaptive to even the most sparse data spaces and for realistic cases of data point cardinalities provides hierarchical clustering of high quality and significant space savings.
机译:本文解决了多维数据的物理聚类问题,这些多维数据以分层结构的形式保留在磁盘上的分层结构中。这称为层次聚类。 OLAP多维数据集的最详细数据是一种典型情况,其中在查询评估期间减少I / O时必须进行层次结构聚类。多维空间中层次结构的存在会导致针对此问题的巨大搜索空间。我们提出了一种数据空间的表示形式,该数据空间导致了多维数据集的块树表示。该模型适用于多维数据集的广泛稀疏性,并基于层次结构值组合提供对数据子集的有效访问。基于搜索空间的这种表示形式,我们将该问题表述为一个块到桶的分配问题,这是一个打包问题,与文献中遵循的线性排序方法相反。我们提出了一种度量来评估已实现的层次聚类的质量(即评估问题的解决方案),并将该问题表述为优化问题。我们证明了它的NP-Hardness并提供了基于线性时间贪婪算法的有效解决方案。该问题的解决方案导致了CUBE File数据结构的构建。我们深入分析了构建的所有步骤,并为出现的有趣子问题提供了解决方案,例如存储区的形成,大数据块的存储以及主存储器中高层节点(根目录)的缓存。最后,我们对CUBE文件对数据空间稀疏性以及对越来越多的数据点的适应性进行了广泛的实验评估。主要结果是,即使是最稀疏的数据空间,多维数据集文件也具有很高的适应性,并且在实际的数据点基数情况下,可以提供高质量的分层聚类并节省大量空间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号