首页> 外文期刊>Knowledge and Information Systems >On the use of hierarchical information in sequential mining-based XML document similarity computation
【24h】

On the use of hierarchical information in sequential mining-based XML document similarity computation

机译:关于层次信息在基于顺序挖掘的XML文档相似度计算中的使用

获取原文
获取原文并翻译 | 示例
           

摘要

Measuring the structural similarity among XML documents is the task of finding their semantic correspondence and is fundamental to many web-based applications. While there exist several methods to address the problem, the data mining approach seems to be a novel, interesting and promising one. It explores the idea of extracting paths from XML documents, encoding them as sequences and finding the maximal frequent sequences using the sequential pattern mining algorithms. In view of the deficiencies encountered by ignoring the hierarchical information in encoding the paths for mining, a new sequential pattern mining scheme for XML document similarity computation is proposed in this paper. It makes use of a preorder tree representation (PTR) to encode the XML tree’s paths so that both the semantics of the elements and the hierarchical structure of the document can be taken into account when computing the structural similarity among documents. In addition, it proposes a postprocessing step to reuse the mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity between XML documents can be introduced. Encouraging experimental results were obtained and reported.
机译:测量XML文档之间的结构相似性是找到它们的语义对应关系的任务,并且对于许多基于Web的应用程序来说都是基础。尽管存在几种解决该问题的方法,但是数据挖掘方法似乎是一种新颖,有趣且有前途的方法。它探索了从XML文档中提取路径,将其编码为序列并使用顺序模式挖掘算法找到最大频繁序列的想法。鉴于在对挖掘路径进行编码时忽略层次信息会遇到的不足,提出了一种新的XML文档相似度计算顺序模式挖掘方案。它利用预排序树表示(PTR)对XML树的路径进行编码,以便在计算文档之间的结构相似性时可以同时考虑元素的语义和文档的层次结构。另外,它提出了一个后处理步骤,以重用挖掘的模式来估计不匹配元素的相似性,从而可以引入另一个度量XML文档之间相似性的指标。获得了令人鼓舞的实验结果并进行了报道。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号