首页> 外文学位 >Data reduction in data warehouses.
【24h】

Data reduction in data warehouses.

机译:减少数据仓库中的数据。

获取原文
获取原文并翻译 | 示例

摘要

Much research has been devoted to the efficient computation of relational aggregations. In this paper we consider the inverse problem, that of deriving (approximately) the original data from the aggregates.; We motivate this problem in the context of two specific application areas, approximate query answering and data analysis. We propose a framework based on the notion of information entropy that enables us to estimate the original values in a data set, given only aggregated information about it. We then show how approximate queries on the data from which the aggregates were derived can be performed using our framework. We also describe an alternate use of the proposed framework that enables us to identify values that deviate from the underlying data distribution, suitable for data ruining purposes.; We present a detailed performance study of the algorithms using both real and synthetic data, highlighting the benefits of our approach as well as the efficiency of the proposed solutions.; Subsequently, we consider the above problem in a space constrained environment, where only a subset of the required aggregates can be stored. More specifically, we wish to select a set of aggregates of interest, subject to a constraint on the total space occupied by these aggregates. The objective is to maximize the total benefit, which is a function of the number and importance of the queries that we can estimate given the selected aggregates.; For this problem, which as we show is NP-hard, there are no known polynomial approximation schemes, and we propose several algorithms for solving it. We explore the use of greedy and randomized techniques as well as clustering based approaches. The solutions presented herein are generic and can be applied to other problem domains as well.; We explore the properties and special characteristics of the above techniques with an experimental evaluation. Our results illustrate the behavior of the algorithms under different settings, and highlight the benefits of each approach. Based on our analysis, we present worst case scenarios for the algorithms. This offers insight into the operation of the algorithms, and provides a practical guide for selecting among the proposed techniques.
机译:许多研究致力于关系聚合的有效计算。在本文中,我们考虑了一个反问题,即从集合中导出(近似)原始数据的问题。我们在两个特定的应用领域(近似查询回答和数据分析)中激发了这个问题。我们提出了一个基于信息熵概念的框架,该框架使我们能够估计数据集中的原始值(仅给出有关它的汇总信息)。然后,我们展示了如何使用我们的框架对汇总数据的近似查询。我们还描述了所提议框架的另一种用法,该框架使我们能够识别与基础数据分布不同的值,这些值适用于数据破坏的目的。我们使用真实数据和合成数据对算法进行了详细的性能研究,突出了我们方法的优势以及所提出解决方案的效率。随后,我们在空间受限的环境中考虑上述问题,在该环境中,只能存储所需聚合的一个子集。更具体地说,我们希望选择一组感兴趣的集合,但要限制这些集合所占用的总空间。目的是最大程度地提高总收益,这取决于我们可以根据给定的汇总估算的查询数量和重要性。对于这个问题,如我们所示,它是NP难的,没有已知的多项式逼近方案,我们提出了几种解决方案。我们探索使用贪婪和随机技术以及基于聚类的方法。这里介绍的解决方案是通用的,也可以应用于其他问题领域。我们通过实验评估来探索上述技术的特性和特殊特性。我们的结果说明了在不同设置下算法的行为,并强调了每种方法的好处。根据我们的分析,我们给出了算法的最坏情况。这提供了对算法操作的深入了解,并为在建议的技术之间进行选择提供了实用指南。

著录项

  • 作者

    Palpanas, Themistoklis.;

  • 作者单位

    University of Toronto (Canada).;

  • 授予单位 University of Toronto (Canada).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2003
  • 页码 131 p.
  • 总页数 131
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号