Web Data Integration Using Approximate String Join

机译：使用近似字符串连接的Web数据集成

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web data integration is an important preprocessing step for web mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many existing data integration algorithms make use of approximate string join, which seeks to (approximately) find all pairs of strings whose distances are less than a certain threshold. In this paper, we propose a new mapping method to detect pairs of strings with similarity above a certain threshold. In our method, each string is first mapped to a point in a high dimensional grid space, then pairs of points whose distances are 1 are identified. We implement it using Oracle SQL and PL/SQL. Finally, we evaluate this method using real data sets. Experimental results suggest that our method is both accurate and efficient.

机译：Web数据集成是Web挖掘的重要预处理步骤。在文本表示不同的网络上有几项记录可能代表相同的现实世界实体。这些记录称为近似重复。数据集成旨在识别此类近似重复项并将其合并为集成记录。许多现有的数据集成算法利用近似字符串连接，该连接旨在（大约）找到距离小于特定阈值的所有字符串。在本文中，我们提出了一种新的映射方法来检测具有高于特定阈值的相似性的串对。在我们的方法中，首先将每个字符串映射到高维网格空间中的点，然后识别距离为1的一对点。我们使用Oracle SQL和PL / SQL实现它。最后，我们使用真实数据集评估此方法。实验结果表明，我们的方法既准确有效。

著录项

来源
《International World Wide Web Conference》|2004年||共2页
会议地点
作者
Yingping Huang; Gregory Madey;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
data integration; approximate string join;

机译：数据集成;近似字符串连接;

相似文献

外文文献
中文文献
专利

1. Integrating XML Data Sources Using Approximate Joins [J] . SUDIPTO GUHA, H. V. JAGADISH, NICK KOUDAS, ACM transactions on database systems . 2006,第1期

机译：使用近似联接集成XML数据源
2. Approximate String Similarity Join using Hashing Techniques under Edit Distance Constraints [J] . Peisen Yuan, Haoyun Wang, Jianghua Che, Journal of software . 2014,第10期

机译：在编辑距离约束下使用哈希技术的近似字符串相似性联接
3. Approximate String Similarity Join using Hashing Techniques under Edit Distance Constraints [J] . Peisen Yuan, Haoyun Wang, Jianghua Che, Journal of Computers . 2014,第10期

机译：在编辑距离约束下使用哈希技术的近似字符串相似性联接
4. Web Data Integration Using Approximate String Join [C] . Yingping Huang, Gregory Madey International World Wide Web Conference . 2004

机译：使用近似字符串连接的Web数据集成
5. Large scale information integration on the Web: Finding, understanding and querying Web databases. [D] . Zhang, Zhen. 2007

机译：Web上的大规模信息集成：查找，理解和查询Web数据库。
6. Fast randomized approximate string matching with succinct hash data structures [O] . Alberto Policriti, Nicola Prezza 2015

机译：快速随机近似字符串匹配具有简洁的哈希数据结构
7. Web Data Integration Using Approximate String Join [O] . Yingping Huang, Gregory Madey 2004

机译：使用近似字符串连接的Web数据集成

Web Data Integration Using Approximate String Join

摘要

著录项

相似文献

相关主题

期刊订阅