首页> 外文会议>Advances in focused retrieval >Utilizing the Structure and Content Information for XML Document Clustering
【24h】

Utilizing the Structure and Content Information for XML Document Clustering

机译:利用结构和内容信息进行XML文档聚类

获取原文
获取原文并翻译 | 示例

摘要

This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge.
机译:本文报告了INEX 2008文档挖掘挑战中使用的聚类方法的实验和结果。群集方法利用了Wikipedia XML文档集合的结构和内容信息。潜在语义内核(LSK)用于根据XML文档的内容特征来度量XML文档之间的语义相似性。潜在语义内核的构建涉及奇异矢量分解(SVD)的计算。在大型特征空间矩阵上,就时间和内存需求而言,SVD的计算非常昂贵。因此,在这种聚类方法中,在执行SVD之前减小了术语文档矩阵的文档空间的维数。文档空间的减少基于Wikipedia XML文档集合的通用结构信息。在INEX 2008文档挖掘挑战中,建议的聚类方法已对Wikipedia集合有效。

著录项

  • 来源
    《Advances in focused retrieval 》|2008年|460-468|共9页
  • 会议地点 Dagstuhl Castle(DE);Dagstuhl Castle(DE)
  • 作者单位

    Faculty of Science and Technology Queensland University of Technology GPO Box 2434, Brisbane Qld 4001, Australia;

    Faculty of Science and Technology Queensland University of Technology GPO Box 2434, Brisbane Qld 4001, Australia;

    Faculty of Science and Technology Queensland University of Technology GPO Box 2434, Brisbane Qld 4001, Australia;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 信息处理(信息加工) ;
  • 关键词

    wikipedia; clustering; LSK; INEX 2008;

    机译:维基百科;集群LSK; INEX 2008;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号