首页> 外文OA文献 >An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
【2h】

An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

机译:一种高效可扩展的XmL文档聚类算法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.
机译:随着XML作为网络上的信息交换语言的标准化,在XML文档中格式化了大量信息。为了有效地分析此信息,分解XML文档并将它们存储在关系表中是一种流行的做法。但是,查询处理变得昂贵,因为在许多情况下,需要大量的联接才能从碎片数据中恢复信息。如果集合由具有不同结构的文档组成(例如,它们来自不同的DTD),则文档中的挖掘群集可以缓解碎片问题。我们提出了一种基于数据中结构信息的XML文档聚类的分层算法(S-GRACE)。提出了结构图(s-graph)的概念,它支持在文档和文档集之间定义的计算有效距离度量。与基于树编辑距离的其他方法相比,这种简单的度量标准可以得出我们高效且有效的新聚类算法。对真实数据的实验表明,我们的算法可以发现手动检查不容易识别的聚类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号