首页> 外文期刊>Distributed and Parallel Databases >Incrementally updating unary inclusion dependencies in dynamic data
【24h】

Incrementally updating unary inclusion dependencies in dynamic data

机译:增量更新动态数据中的一元包含项依赖关系

获取原文
获取原文并翻译 | 示例

摘要

Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications like data profiling, data cleaning, entity resolution, and schema matching. Their discovery in an unknown dataset is at the core of any data-analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for application on dynamic datasets. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering, from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. An update of the clusters does not need access to the dataset because of special data structures designed to efficiently support the updating process. We performed an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116.2 million tuples. The results showed that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996% for both the insertion and the deletion.
机译:包含依赖关系是完整性约束的最基本类别之一。它们在经典数据管理中的重要性通过数据分析,数据清理,实体解析和模式匹配等现代应用程序得到了加强。它们在未知数据集中的发现是任何数据分析工作的核心。因此,几种研究方法集中于在给定的静态数据集中的有效发现。但是,这些方法都不适合应用于动态数据集。在这些情况下,发现技术应该能够在数据集中更新后有效地更新包含依赖关系,而无需重新处理整个数据集。我们提出了增量更新一元包含项依赖关系的第一种方法。特别地,我们的方法基于属性聚类的概念,从该属性聚类可以有效地导出一元包含相关性。每次更新数据集后,我们都会增量更新聚类。集群的更新不需要访问数据集,这是因为旨在有效地支持更新过程的特殊数据结构。通过将其应用到具有数百个属性和超过1.162亿个元组的大型数据集,我们对方法进行了详尽的分析。结果表明,增量发现显着减少了静态发现所需的运行时间。对于插入和删除,运行时间的减少最多可达到99.9996%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号