首页> 外文OA文献 >Enriching XML documents clustering by using concise structure and content
【2h】

Enriching XML documents clustering by using concise structure and content

机译:通过使用简洁的结构和内容来丰富XML文档集群

摘要

With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
机译:随着Web上XML文档数量的增加,有效组织这些XML文档以从中检索有用信息变得至关重要。一种可能的解决方案是在XML文档上应用群集,以发现可促进有效数据管理,信息检索和查询处理的知识。但是,由于它们的异质性和结构不规则性,从这些类型的半结构化文档中发现知识时会出现许多问题。现有的有关群集技术的大多数研究都只关注XML文档的一个功能,由于可伸缩性和复杂性问题,它们要么是结构,要么是内容。基于结构或内容的聚类形式获得的知识不适用于现实数据集。因此,必须同时包含XML文档的结构和内容,以提高群集解决方案的准确性和含义。但是,由于数据的高维性,在聚类过程中包含这两种信息会导致底层聚类算法的巨大开销。本文的总体目标是通过以下方法解决这些问题:(1)提出利用频繁模式挖掘技术来减小维度的方法; (2)开发模型以有效地结合XML文档的结构和内容; (3)利用提出的模型进行聚类。该研究首先以频繁子树的形式确定结构相似性,然后使用这些频繁子树来表示XML文档的受约束内容,从而确定内容相似性。开发了具有两种类型的模型(隐式和显式)的集群框架。隐式模型使用向量空间模型(VSM)来组合结构和内容信息。显式模型使用高阶模型(即3阶张量空间模型(TSM))来显式组合结构和内容信息。本文还提出了一种新颖的增量技术,用于分解大型张量模型,以利用分解后的解决方案对XML文档进行聚类。拟议的框架及其组件已在几个具有极端特征的现实生活数据集中进行了广泛评估,以了解拟议框架在现实情况下的有用性。另外,这项研究评估了维基百科数据集上信息检索中聚类选择问题的聚类过程的结果。实验结果表明,提出的频繁模式挖掘和聚类方法优于相关的最新方法。特别地,所提出的利用频繁的结构来约束内容的框架相对于仅基于内容的聚类结果和仅基于结构的聚类结果显示出准确性的提高。在大规模数据集上进行的可伸缩性评估实验清楚地表明了所提出方法相对于最新方法的优势。特别是,本文的工作有助于有效地组合XML文档的结构和内容进行聚类,以提高聚类解决方案的准确性。此外,它还通过解决频繁模式挖掘中的研究差距做出了贡献,以生成具有可用于聚类的各种节点关系的高效简洁的频繁子树。

著录项

  • 作者

    Kutty Sangeetha;

  • 作者单位
  • 年度 2011
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号