首页> 外文期刊>Big Data Research >Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking
【24h】

Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking

机译:通过增强的元区块提高大型实体解析的效率

获取原文
获取原文并翻译 | 示例
           

摘要

Entity Resolution constitutes a quadratic task that typically scales to large entity collections through blocking. The resulting blocks can be restructured by Meta-blocking to raise precision at a limited cost in recall. At the core of this procedure lies the blocking graph, where the nodes correspond to entities and the edges connect the comparable pairs. There are several configurations for Meta-blocking, but no hints on best practices. In general, the node-centric approaches are more robust and suitable for a series of applications, but suffer from low precision, due to the large number of unnecessary comparisons they retain. In this work, we present three novel methods for node-centric Meta-blocking that significantly improve precision. We also introduce a pre-processing method that restricts the size of the blocking graph by removing a large number of noisy edges. As a result, it reduces the overhead time of Meta-blocking by 2 to 5 times, while increasing precision by up to an order of magnitude for a minor cost in recall. The same technique can be applied as graph-free Meta-blocking, enabling for the first time Entity Resolution over very large datasets even on commodity hardware. We evaluate our approaches through an extensive experimental study over 19 voluminous, established datasets. The outcomes indicate best practices for the configuration of Meta-blocking and verify that our techniques reduce the resolution time of state-of-the-art methods by up to an order of magnitude.
机译:实体解析构成了一个二次任务,通常通过阻塞扩展到大型实体集合。可以通过元块重构生成的块,以提高召回率,而以有限的成本提高精度。该过程的核心是阻塞图,其中节点对应于实体,边连接可比对。元阻止有几种配置,但没有最佳实践的提示。通常,以节点为中心的方法更健壮,适合于一系列应用程序,但由于保留了大量不必要的比较,因此精度较低。在这项工作中,我们提出了三种以节点为中心的元阻塞的新方法,这些方法显着提高了精度。我们还介绍了一种预处理方法,该方法通过消除大量的噪声边缘来限制阻塞图的大小。结果,它将元数据块的开销时间减少了2到5倍,同时将精度提高了一个数量级,而召回的成本却很小。相同的技术可以用作无图元阻止,从而首次在非常大的数据集上实现实体解析,即使在商品硬件上也是如此。我们通过对19个庞大的已建立数据集的广泛实验研究来评估我们的方法。结果表明了配置元数据块的最佳实践,并验证了我们的技术将最新方法的解析时间缩短了一个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号