首页> 外文会议>33rd International Conference on Very Large Data Bases(VLDB 2007) >Example-driven Design of Efficient Record Matching Queries
【24h】

Example-driven Design of Efficient Record Matching Queries

机译:高效的记录匹配查询的示例驱动设计

获取原文

摘要

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario. Specifically, the number of options both in terms of string matching operations as well as the choice of external sources can be daunting. In this paper, we exploit the availability of positive and negative examples to search through this space and suggest an initial record matching query. Such queries can be subsequently modified by the programmer as needed. We ensure that the record matching queries our approach produces are (1) efficient: these queries can be run on large datasets by leveraging operations that are well-supported by RDBMSs, and (2) explainable: the queries are easy to understand so that they may be modified by the programmer with relative ease. We demonstrate the effectiveness of our approach on several real-world datasets.
机译:记录匹配是识别与同一真实世界实体匹配的记录的任务。对于各种商业智能应用程序来说,这是一个非常重要的问题。记录匹配的实现依赖于精确和近似的字符串匹配(例如,编辑距离)以及外部参考数据源的使用。记录匹配可以看作是由少量基本运算符组成的查询。但是,制定这样的记录匹配查询很困难,并且取决于特定的应用场景。具体而言,就字符串匹配操作以及外部源的选择而言,选项的数量可能令人生畏。在本文中,我们利用肯定和否定示例的可用性在此空间中进行搜索,并提出初始记录匹配查询。随后,程序员可以根据需要修改此类查询。我们确保我们的方法产生的记录匹配查询是(1)高效的:这些查询可以通过利用RDBMS支持的操作在大型数据集上运行,以及(2)可解释:查询易于理解,因此它们程序员可以相对轻松地对其进行修改。我们在几种现实世界的数据集上证明了我们的方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号