首页> 外文会议> >Integrating element and term semantics for similarity-based XML document clustering
【24h】

Integrating element and term semantics for similarity-based XML document clustering

机译:集成元素和术语语义以实现基于相似度的XML文档聚类

获取原文

摘要

Structured link vector model (SLVM) is a recently proposed document representation that takes into account both structural and semantic information for measuring XML document similarity. Its formulation includes an element similarity matrix for capturing the semantic similarity between XML elements - the structural components of XML documents. In this paper, instead of applying heuristics to define the similarity matrix, we proposed to learn the matrix using pair wise similar training data in an iterative manner. In addition, we extended SLVM to SLVM-LSI by incorporating term semantics into SLVM using latent semantic indexing, with the element similarity related properties of the original SLVM preserved. For performance evaluation, we applied SLVM-LSI to similarity-based clustering of two XML datasets and the proposed SLVM-LSI was found to significantly outperform the conventional vector space model and the edit-distance based methods. The similarity matrix, obtained as a byproduct via the learning, can provide higher level knowledge about the semantic relationship between the XML elements.
机译:结构链接矢量模型(SLVM)是最近提出的文档表示形式,它同时考虑了用于测量XML文档相似性的结构信息和语义信息。它的表述包括一个元素相似度矩阵,用于捕获XML元素(XML文档的结构组件)之间的语义相似度。在本文中,我们建议不使用启发式方法来定义相似性矩阵,而是以迭代方式使用成对相似的训练数据来学习矩阵。此外,我们通过使用潜在语义索引将术语语义纳入SLVM来将SLVM扩展到SLVM-LSI,同时保留了原始SLVM的元素相似性相关属性。为了进行性能评估,我们将SLVM-LSI应用于两个XML数据集的基于相似度的聚类,发现拟议的SLVM-LSI明显优于传统的矢量空间模型和基于编辑距离的方法。通过学习作为副产品获得的相似性矩阵可以提供有关XML元素之间的语义关系的高级知识。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号