...
首页> 外文期刊>Knowledge and information systems >Knowledge-based vector space model for text clustering
【24h】

Knowledge-based vector space model for text clustering

机译:基于知识的文本聚类向量空间模型

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g. words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.
机译:本文提出了一种新的基于知识的向量空间模型(VSM),用于文本聚类。在新模型中,术语(例如单词或概念)之间的语义关系包括在将文本文档表示为一组向量中。这个想法是为了更有效地计算两个文档之间的差异,从而可以增强文本聚类的结果。在本文中,两个术语之间的语义关系由两个术语的相似性定义。这种相似性用于重新加权VSM中的词频。我们考虑并研究了基于两种不同方法的两种不同的相似性度量,用于计算两个术语之间的语义关系。第一种方法基于WordNet和MeSH等现有本体。我们定义了一种新的相似性度量,该度量结合了边缘计数技术,平均距离和位置加权方法,以从本体层次结构中计算两个术语的相似性。第二种方法是利用文本语料库来构建术语之间的关系,然后计算它们的语义相似度。三种聚类算法(二等分k均值,特征权重k均值和分层聚类算法)已被用于聚类新的基于知识的VSM中表示的真实世界文本数据。实验结果表明,基于新模型的聚类性能明显优于基于传统术语的VSM。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号