首页> 外文会议>International World Wide Web Conference >Web Data Integration Using Approximate String Join
【24h】

Web Data Integration Using Approximate String Join

机译:使用近似字符串连接的Web数据集成

获取原文

摘要

Web data integration is an important preprocessing step for web mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many existing data integration algorithms make use of approximate string join, which seeks to (approximately) find all pairs of strings whose distances are less than a certain threshold. In this paper, we propose a new mapping method to detect pairs of strings with similarity above a certain threshold. In our method, each string is first mapped to a point in a high dimensional grid space, then pairs of points whose distances are 1 are identified. We implement it using Oracle SQL and PL/SQL. Finally, we evaluate this method using real data sets. Experimental results suggest that our method is both accurate and efficient.
机译:Web数据集成是Web挖掘的重要预处理步骤。在文本表示不同的网络上有几项记录可能代表相同的现实世界实体。这些记录称为近似重复。数据集成旨在识别此类近似重复项并将其合并为集成记录。许多现有的数据集成算法利用近似字符串连接,该连接旨在(大约)找到距离小于特定阈值的所有字符串。在本文中,我们提出了一种新的映射方法来检测具有高于特定阈值的相似性的串对。在我们的方法中,首先将每个字符串映射到高维网格空间中的点,然后识别距离为1的一对点。我们使用Oracle SQL和PL / SQL实现它。最后,我们使用真实数据集评估此方法。实验结果表明,我们的方法既准确有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号