首页> 外文会议>IEEE international conference on data engineering >Detecting unique column combinations on dynamic data
【24h】

Detecting unique column combinations on dynamic data

机译:检测动态数据的唯一列组合

获取原文
获取外文期刊封面目录资料

摘要

The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.
机译:在未知数据集中发现所有唯一(和非唯一)列组合是任何数据分析工作的核心。唯一列组合类似于关系数据集的候选键。几种研究方法集中于在给定的静态数据集中进行有效的发现。但是,这些方法都不适合在动态数据集上使用,例如事务数据库,社交网络和科学应用程序。在这些情况下,数据分析技术应该能够在元组插入或删除后有效地发现新的唯一性和非唯一性(并验证旧的),而无需重新分析整个数据集。我们提出了第一种方法,可以有效地发现动态数据集上的唯一约束和非唯一约束,而这些约束与初始数据集的大小无关。特别是,Swan利用智能选择的索引来最大程度地减少对旧数据的访问。我们对Swan进行了详尽的分析,并将其与两种独特发现的最新技术进行比较:Gordian和Ducc。结果表明,Swan以及它们的渐进式适应都明显胜过两者。对于插入件,Swan比Gordian快63倍以上,比Ducc快50倍。对于删除,Swan比Gordian快15倍以上,比Ducc快1个数量级。实际上,Swan甚至通过将数据集划分为静态部分和一组插入来改善静态情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号