首页> 外文会议>Foundations of intelligent systems >Semantically-Guided Clustering of Text Documents via Frequent Subgraphs Discovery
【24h】

Semantically-Guided Clustering of Text Documents via Frequent Subgraphs Discovery

机译:通过频繁子图发现对文本文档进行语义指导的聚类

获取原文
获取原文并翻译 | 示例

摘要

In this paper we introduce and analyze two improvements to GDClust [1], a system for document clustering based on the co-occurrence of frequent subgraphs. GDClust (Graph-Based Document Clustering) works with frequent senses derived from the constraints provided by the natural language rather than working with the co-occurrences of frequent keywords commonly used in the vector space model (VSM) of document clustering. Text documents are transformed to hierarchical document-graphs, and an efficient graph-mining technique is used to find frequent subgraphs. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. In this paper, we introduce two novel mechanisms called the Subgraph-Extension Generator (SEG) and the Maximum Subgraph-Extension Generator (MaxSEG) which directly utilize constraints from the natural language to reduce the number of candidates and the overhead imposed by our first implementation of GDClust.
机译:在本文中,我们介绍并分析了GDClust [1]的两个改进,它是基于频繁出现的子图的同时出现的文档聚类系统。 GDClust(基于图的文档聚类)可以从自然语言提供的约束中提取出常识,而不是与文档聚类的向量空间模型(VSM)中常用的频繁出现的关键字共现。将文本文档转换为分层文档图,并使用一种有效的图挖掘技术来查找频繁的子图。然后,将发现的频繁子图用于生成准确的基于感觉的文档簇。在本文中,我们介绍了两种新颖的机制,称为子图扩展生成器(SEG)和最大子图扩展生成器(MaxSEG),它们直接利用自然语言的约束来减少候选对象的数量和第一个实现所带来的开销GDClust。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号