首页> 外文会议>International Conference on Broadband Communications, Informatics and Biomedical Applications >WordNet-Based and N-Grams-Based Document Clustering: A Comparative Study
【24h】

WordNet-Based and N-Grams-Based Document Clustering: A Comparative Study

机译:基于Wordnet和基于N-GRAM的文档聚类:比较研究

获取原文

摘要

A great number of methods of unsupervised classifications also called clustering were applied to the textual documents. In this paper, we initially propose the method of the self-organizing maps of Kohonen for the clustering of the textual documents based on the n-grams representation. The same method based on the synsets of WordNet as terms for the representation of the textual documents will be studied thereafter. The effects of these methods are examined in several experiments using 4 measurements of similarity: the Cosine distance, the Euclidean distance, the Squared Euclidean distance and the Manhattan distance. The reuters-21578 corpus is used for evaluation. The evaluation was done, by using the F-measure and the entropy. The results obtained show that in spite of the good results obtained by the method of the n-grams, the fact of adding lexical knowledge in the representation makes it possible to build a better classification.
机译:许多未经监督的分类方法也被称为群集应用于文本文档。在本文中,我们最初提出了基于N-GRAM表示的文本文档的聚类自组织kohonen的自组织地图的方法。此后将研究基于Wordnet的Synpsets的相同方法作为文本文档的表示的术语。使用4次相似性测量的几个实验中检查了这些方法的效果:余弦距离,欧几里德距离,平方欧几里德距离和曼哈顿距离。 Reuters-21578语料库用于评估。通过使用F测量和熵进行评估。结果表明,尽管通过N-GRAM的方法获得的良好结果,但在代表中添加词汇知识的事实使得可以建立更好的分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号