Data reduction in data warehouses.

机译：减少数据仓库中的数据。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Much research has been devoted to the efficient computation of relational aggregations. In this paper we consider the inverse problem, that of deriving (approximately) the original data from the aggregates.; We motivate this problem in the context of two specific application areas, approximate query answering and data analysis. We propose a framework based on the notion of information entropy that enables us to estimate the original values in a data set, given only aggregated information about it. We then show how approximate queries on the data from which the aggregates were derived can be performed using our framework. We also describe an alternate use of the proposed framework that enables us to identify values that deviate from the underlying data distribution, suitable for data ruining purposes.; We present a detailed performance study of the algorithms using both real and synthetic data, highlighting the benefits of our approach as well as the efficiency of the proposed solutions.; Subsequently, we consider the above problem in a space constrained environment, where only a subset of the required aggregates can be stored. More specifically, we wish to select a set of aggregates of interest, subject to a constraint on the total space occupied by these aggregates. The objective is to maximize the total benefit, which is a function of the number and importance of the queries that we can estimate given the selected aggregates.; For this problem, which as we show is NP-hard, there are no known polynomial approximation schemes, and we propose several algorithms for solving it. We explore the use of greedy and randomized techniques as well as clustering based approaches. The solutions presented herein are generic and can be applied to other problem domains as well.; We explore the properties and special characteristics of the above techniques with an experimental evaluation. Our results illustrate the behavior of the algorithms under different settings, and highlight the benefits of each approach. Based on our analysis, we present worst case scenarios for the algorithms. This offers insight into the operation of the algorithms, and provides a practical guide for selecting among the proposed techniques.

机译：许多研究致力于关系聚合的有效计算。在本文中，我们考虑了一个反问题，即从集合中导出（近似）原始数据的问题。我们在两个特定的应用领域（近似查询回答和数据分析）中激发了这个问题。我们提出了一个基于信息熵概念的框架，该框架使我们能够估计数据集中的原始值（仅给出有关它的汇总信息）。然后，我们展示了如何使用我们的框架对汇总数据的近似查询。我们还描述了所提议框架的另一种用法，该框架使我们能够识别与基础数据分布不同的值，这些值适用于数据破坏的目的。我们使用真实数据和合成数据对算法进行了详细的性能研究，突出了我们方法的优势以及所提出解决方案的效率。随后，我们在空间受限的环境中考虑上述问题，在该环境中，只能存储所需聚合的一个子集。更具体地说，我们希望选择一组感兴趣的集合，但要限制这些集合所占用的总空间。目的是最大程度地提高总收益，这取决于我们可以根据给定的汇总估算的查询数量和重要性。对于这个问题，如我们所示，它是NP难的，没有已知的多项式逼近方案，我们提出了几种解决方案。我们探索使用贪婪和随机技术以及基于聚类的方法。这里介绍的解决方案是通用的，也可以应用于其他问题领域。我们通过实验评估来探索上述技术的特性和特殊特性。我们的结果说明了在不同设置下算法的行为，并强调了每种方法的好处。根据我们的分析，我们给出了算法的最坏情况。这提供了对算法操作的深入了解，并为在建议的技术之间进行选择提供了实用指南。

著录项

作者
Palpanas, Themistoklis.;
展开▼
作者单位

University of Toronto (Canada).;

展开▼
授予单位 University of Toronto (Canada).;
学科 Computer Science.
学位 Ph.D.
年度 2003
页码 131 p.
总页数 131
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Implementation of data access and use procedures in clinical data warehouses. A systematic review of literature and publicly available policies [J] . Elena Pavlenko, Daniel Strech, Holger Langhof BMC Medical Informatics and Decision Making . 2020,第1期

机译：在临床数据仓库中实现数据访问和使用过程。对文献和公开可用政策的系统审查
2. Improving practice: establishing processes to query clinical data warehouses. [J] . Brokel JM Urologic nursing: official journal of the American Urological Association Allied . 2009,第5期

机译：改进实践：建立查询临床数据仓库的过程。
3. A novel data reduction method: Distance based data reduction and its application to classification of epileptiform EEG signals [J] . Polat K, Guenes S Applied mathematics and computation . 2008,第1期

机译：一种新颖的数据约简方法：基于距离的数据约简及其在癫痫样脑电信号分类中的应用
4. Less Reliable Page Error Reduction for 3D-TLC NAND Flash Memories with Data Overhead Reduction by 40 and Data-retention Time Increase by 5.0x [C] . Kyosuke Maeda, Kyoji Mizoguchi, Ken Takeuchi 2019 Silicon Nanoelectronics Workshop . 2019

机译：3D-TLC NAND闪存的页面错误减少可靠性降低，数据开销减少了40％，数据保留时间增加了5.0倍
5. Efficient data management in spatial data warehouses. [D] . Yu, Songmei. 2006

机译：空间数据仓库中的高效数据管理。
6. Implementation of data access and use procedures in clinical data warehouses. A systematic review of literature and publicly available policies [O] . Elena Pavlenko, Daniel Strech, Holger Langhof 2020

机译：在临床数据仓库中实现数据访问和使用过程。对文献和公开可用政策的系统审查
7. Implementation of data access and use procedures in clinical data warehouses. A systematic review of literature and publicly available policies [O] . Elena Pavlenko, Daniel Strech, Holger Langhof 2020

机译：在临床数据仓库中实现数据访问和使用过程。对文献和公开可用政策的系统审查
8. Voice Data Entry in NISTARS Warehouses. [R] . Stokes, J. M. 1993

机译：NIsTaRs仓库中的语音数据输入。

Data reduction in data warehouses.

摘要

著录项

相似文献

相关主题

期刊订阅