首页> 外文OA文献 >Approximate Joins for Data-Centric XML
【2h】

Approximate Joins for Data-Centric XML

机译:以数据为中心的XML的近似联接

摘要

In data integration applications, a join matches elements thatare common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approxi- mate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate win- dowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.
机译:在数据集成应用程序中,联接匹配两个数据源共有的元素。但是,通常每个元素中的元素表示略有不同,因此必须使用近似联接。对于XML数据,大多数近似连接策略都是基于某种有序的树匹配技术。但是在以数据为中心的XML中,顺序无关紧要:即使两个元素的子元素顺序发生变化,两个元素也应该匹配。在本文中,我们给出了无序树的近似连接的解决方案。我们的解决方案基于开窗的pq-gram。我们开发了一种有效的技术,可通过三步过程系统地生成带窗口的pq-gram:对无序树进行排序,使用虚拟节点扩展已排序的树以及在扩展树上计算加窗的pq-gram。两棵排序树之间的开窗pq-gram距离近似于各个无序树之间的树编辑距离。基于开窗pq-gram的近似联接算法被实现为字符串上的相等联接,从而避免了计算每对输入树之间的距离所花费的成本。我们使用合成数据和现实世界数据进行的实验证实了分析结果,并表明我们的技术既有用又可扩展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号