...
首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Schema-Agnostic Progressive Entity Resolution
【24h】

Schema-Agnostic Progressive Entity Resolution

机译:与模式无关的渐进式实体解析

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Entity Resolution (ER) is the task of finding entity profiles that correspond to the same real-world entity. Progressive ER aims to efficiently resolve large datasets when limited time and/or computational resources are available. In practice, its goal is to provide the best possible partial solution by approximating the optimal comparison order of the entity profiles. So far, Progressive ER has only been examined in the context of structured (relational) data sources, as the existing methods rely on schema knowledge to save unnecessary comparisons: they restrict their search space to similar entities with the help of schema-based blocking keys (i.e., signatures that represent the entity profiles). As a result, these solutions are not applicable in Big Data integration applications, which involve large and heterogeneous datasets, such as relational and RDF databases, JSON files, Web corpus etc. To cover this gap, we propose a family of schema-agnostic Progressive ER methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety. First, we introduce two naive schema-agnostic methods, showing that straightforward solutions exhibit a poor performance that does not scale well to large volumes of data. Then, we propose four different advanced methods. Through an extensive experimental evaluation over 7 real-world, established datasets, we show that all the advanced methods outperform to a significant extent both the naive and the state-of-the-art schema-based ones. We also investigate the relative performance of the advanced methods, providing guidelines on the method selection.
机译:实体解析(ER)是查找与同一真实世界实体对应的实体配置文件的任务。渐进式ER旨在在有限的时间和/或计算资源可用时有效地解析大型数据集。实际上,其目标是通过逼近实体配置文件的最佳比较顺序来提供最佳的局部解决方案。到目前为止,渐进式ER仅在结构化(关系)数据源的上下文中进行了检查,因为现有方法依赖于架构知识来保存不必要的比较:借助基于架构的阻止键,它们将搜索空间限制为相似的实体(即代表实体配置文件的签名)。因此,这些解决方案不适用于涉及大型和异构数据集的大数据集成应用程序,例如关系和RDF数据库,JSON文件,Web语料库等。为了弥补这一空白,我们提出了一系列与模式无关的渐进式ER方法不需要模式信息,因此适用于任何模式种类的异构数据源。首先,我们介绍了两种幼稚的模式无关方法,这些方法表明简单的解决方案显示的性能很差,无法很好地扩展到大量数据。然后,我们提出了四种不同的高级方法。通过对7个现实世界中已建立的数据集进行的广泛实验评估,我们表明,所有先进方法均在很大程度上优于幼稚方法和基于模式的最先进方法。我们还将调查高级方法的相对性能,为方法选择提供指导。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号