首页> 外文会议>ACM SIGMOD international conference on management of data >Sampling Dirty Data for Matching Attributes
【24h】

Sampling Dirty Data for Matching Attributes

机译:采样匹配属性的脏数据

获取原文

摘要

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.
机译:我们调查创建和分析关系数据库的样本以在字符串属性之间找到关系的问题。我们的重点是在识别其值设置重叠的属性对,典型连接在此类属性上的预先条件。然而,现实世界数据集通常“脏”,特别是在集成来自不同源的数据时。要处理此问题,我们提出了字符串组之间的新相似度措施,这不仅考虑基于集的相似性,而且在字符串实例之间的相似性。为了使措施有效,我们为分布式样本创建和相似性计算开发有效的算法。测试结果表明,对于脏数据,我们的测量对于测量值重叠比现有的基于样品的方法更准确,但我们也观察到精度和速度之间存在明显的权衡。这激励了一种两级过滤方法,这两种测量都在相同的样本上运行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号