首页> 外文会议>International Symposium on Methodologies for Intelligent Systems >Semantically-Guided Clustering of Text Documents via Frequent Subgraphs Discovery
【24h】

Semantically-Guided Clustering of Text Documents via Frequent Subgraphs Discovery

机译:通过频繁的子图发现,语义导游文本文档的聚类

获取原文

摘要

In this paper we introduce and analyze two improvements to GDClust [1], a system for document clustering based on the co-occurrence of frequent subgraphs. GDClust (Graph-Based Document Clustering) works with frequent senses derived from the constraints provided by the natural language rather than working with the co-occurrences of frequent keywords commonly used in the vector space model (VSM) of document clustering. Text documents are transformed to hierarchical document-graphs, and an efficient graph-mining technique is used to find frequent subgraphs. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. In this paper, we introduce two novel mechanisms called the Subgraph-Extension Generator (SEG) and the Maximum Subgraph-Extension Generator (MaxSEG) which directly utilize constraints from the natural language to reduce the number of candidates and the overhead imposed by our first implementation of GDClust.
机译:在本文中,我们介绍和分析GdClust [1]的改进,基于频繁子图的共同发生的文档聚类系统。基于GdClust(基于图形的文档群集)与自然语言提供的约束导出的频繁感官,而不是使用文档群集的矢量空间模型(VSM)中通常使用的频繁关键字的共同发生。文本文档被转换为分层文档图形,并且使用高效的图形挖掘技术来查找频繁的子图。然后被发现频繁的子图以产生准确的基于感测的文档集群。在本文中,我们介绍了两个称为子图 - 扩展发生器(SEG)的新机制,以及最大的子图 - 扩展发生器(MAXSEG),它直接利用来自自然语言的约束来减少我们第一次实施的候选者的数量和开销gdclust。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号