首页> 外文会议>ACM SIGMOD international conference on Management of data >Information-theoretic tools for mining database structure from large data sets
【24h】

Information-theoretic tools for mining database structure from large data sets

机译:从大数据集中挖掘数据库结构的信息理论工具

获取原文

摘要

Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functionaldependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.
机译:数据设计的特征是达到使每个数据的信息内容最大化(或等效地,使冗余最小化)的设计的过程。信息内容(或冗余)是针对数据的规定模型(该模型通常表示为一组约束)进行度量的。在这项工作中,我们考虑在规定模型未知或不完整的环境中进行数据重新设计的问题。具体来说,我们考虑在数据实例中查找结构线索的问题,该实例可能包含错误,缺失值和重复记录。我们提出了一组信息理论工具,用于查找结构摘要,这些摘要对于表征数据的信息内容很有用,并最终在数据设计中有用。我们提供了用于在大型分类数据集上创建这些摘要的算法。我们研究了这些摘要在一项特定的物理设计任务中的使用,即根据其数据冗余对功能依赖性进行排名。我们展示了物理数据设计工具如何使用排名来查找关系的良好垂直分解(可改善设计信息内容的分解)。我们对真实数据集的方法进行了评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号