首页> 外文期刊>ACM transactions on database systems >Integrating XML Data Sources Using Approximate Joins
【24h】

Integrating XML Data Sources Using Approximate Joins

机译:使用近似联接集成XML数据源

获取原文
获取原文并翻译 | 示例

摘要

XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.
机译:XML由于能够表示来自各种来源的数据而被公认为是明天的数据交换标准。因此,XML很有可能是整合来自多个来源的数据的格式。在本文中,我们研究通过实现为联接操作的相关性来集成XML数据源的问题。此操作的挑战性方面是XML文档结构。两个文档可能传达的信息大致相同或完全相同,但结构可能完全不同。因此,除了内容外,还必须将结构上的近似匹配项合并到联接操作中。我们使用定义良好的距离概念来量化XML文档对在结构和内容上的近似匹配。我们展示了如何将具有度量属性的距离概念整合到XML数据源之间的联接框架中,并介绍了引用集的概念以促进此操作。直观地,参考集由用于投影数据空间的数据元素组成。我们描述了构成参考集的最佳选择的特征,并提出了基于采样的算法来识别它们。然后,我们使用一对树之间的树编辑距离实例化连接框架。接下来,我们将注意力转向利用众所周知的索引结构来提高近似XML连接操作的性能。我们提出了一种方法,可以针对此问题适应索引结构,并根据R树实例化它。我们使用大量的真实和合成XML数据集,各种感兴趣的参数来展示我们的解决方案的实用性,并重点介绍该方法的性能优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利