Sampling Dirty Data for Matching Attributes

机译：采样匹配属性的脏数据

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.

机译：我们调查创建和分析关系数据库的样本以在字符串属性之间找到关系的问题。我们的重点是在识别其值设置重叠的属性对，典型连接在此类属性上的预先条件。然而，现实世界数据集通常“脏”，特别是在集成来自不同源的数据时。要处理此问题，我们提出了字符串组之间的新相似度措施，这不仅考虑基于集的相似性，而且在字符串实例之间的相似性。为了使措施有效，我们为分布式样本创建和相似性计算开发有效的算法。测试结果表明，对于脏数据，我们的测量对于测量值重叠比现有的基于样品的方法更准确，但我们也观察到精度和速度之间存在明显的权衡。这激励了一种两级过滤方法，这两种测量都在相同的样本上运行。

著录项

来源
《ACM SIGMOD international conference on management of data》|2010年||共12页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词
algorithms; experimentation; measurement; performance;

机译：算法;实验;测量;表现;

相似文献

外文文献
中文文献
专利

1. Agent-based Household Micro-Datasets: An Estimation Method Composed of Generalized Attributes with Probabilistic Distributions from Sample Data and Available Control Totals by Attribute [J] . Nao SUGIKI, Varameth VICHIENSAN, Noriko OTANI, Asian Transport Studies . 2012,第1期

机译：基于代理的家庭微型数据集：一种估计方法，该方法由样本数据和按属性提供的可用控制总数的概率分布的广义属性组成
2. A Novel Normalization Forms for Relational Database Design throughout Matching Related Data Attribute [J] . Youseef Alotaibi, Bashar Ramadan International Journal of Engineering and Manufacturing(IJEM) . 2017,第5期

机译：匹配相关数据属性的关系数据库设计新规范化形式
3. A Dirty Word or a Dirty World?Attribute Framing, Political Affiliation,and Query Theory [J] . David J. Hardisty, Eric J. Johnson, Elke U. Weber Psychological science: a journal of the American Psychological Society . 2010,第1期

机译：脏话还是脏世界？属性框架，政治隶属关系和查询理论
4. Sampling Dirty Data for Matching Attributes [C] . Henning Koehler, Xiaofang Zhou, Shazia Sadiq, ACM SIGMOD international conference on management of data;SIGMOD 2010 . 2010

机译：采样肮脏数据以匹配属性
5. Subgraph Matching on Attributed Multiplex Networks with Applications to Knowledge Graphs [D] . Tu, Thomas K. 2021

机译：在具有知识图形的应用程序中的归属多路复用网络上的子图匹配
6. Breaking the Deadlock: Simultaneously Discovering Attribute Matching and Cluster Matching with Multi-Objective Metaheuristics [O] . Haishan Liu, Dejing Dou, Hao Wang -1

机译：打破僵局：同时发现与多目标核心学匹配的属性匹配和群集
7. Sampling Dirty Data for Matching Attributes [O] . Henning Köhler, Xiaofang Zhou, Shazia Sadiq, 2010

机译：采样脏数据以匹配属性

Sampling Dirty Data for Matching Attributes

摘要

著录项

相似文献

相关主题

期刊订阅