首页> 外文会议>International conference on very large data bases >BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution
【24h】

BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution

机译:BLAST:用于实体解析的松散的模式感知元阻止方法

获取原文
获取外文期刊封面目录资料

摘要

Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches. Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features. In this paper, we demonstrate how "loose" schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for efficiently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.
机译:标识引用同一实体的记录是数据集成的基本步骤。由于比较每对记录非常昂贵,因此通常采用阻塞技术来降低此任务的复杂性。这些技术将记录划分为多个块,并将比较限制为一个块中同时出现的记录。通常,为了处理高度异构和嘈杂的数据(例如Web的半结构化数据),这些技术依靠冗余来减少丢失匹配项的机会。元数据块是重构由基于冗余的数据块技术生成的数据块的任务,从而消除了多余的比较。现有的元阻止方法仅依赖于与模式无关的功能。在本文中,我们演示了如何利用“松散的”模式信息(即直接从数据中收集的统计信息)来提高整体松散的模式感知(元)阻塞方法的质量,该方法可用于加快您最喜欢的实体解析算法。我们称其为Blast(使用松散感知模式技术进行阻止)。我们展示了Blast如何通过采用基于LSH的步骤来有效地缩放到大型数据集来自动提取这些松散的信息。我们在现实世界的数据集上实验性地证明了Blast如何胜过最新的无监督元阻止方法,并且在许多情况下还优于受监督的元阻止方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号