首页> 外文期刊>Data & Knowledge Engineering >Text document clustering based on frequent word meaning sequences
【24h】

Text document clustering based on frequent word meaning sequences

机译:基于频繁词义序列的文本文档聚类

获取原文
获取原文并翻译 | 示例

摘要

Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set , and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms.
机译:现有的大多数文本聚类算法都使用矢量空间模型,该模型将文档视为单词袋。因此,文档中的单词序列将被忽略,而自然语言的含义在很大程度上取决于它们。在本文中,我们提出了两种新的文本聚类算法,分别是基于频繁单词序列的聚类(CFWS)和基于频繁单词含义序列的聚类(CFWMS)。单词是文档中显示的单词形式,单词含义是由同义词形式表达的概念。如果单词(含义)序列出现在文本数据库中一定比例的文档中,则该序列很常见。频繁的单词(含义)序列可以提供有关那些文本文档的紧凑而有价值的信息。为了进行实验,我们使用了Reuters-21578文本集,Classic数据集的CISI文档以及文本检索会议(TREC)[2004年从文档进行高精度检索(HARD)文本检索会议的轨迹]的语料库。我们的实验结果表明,CFWS和CFWMS的聚类精度比平分k均值(BKM)[M. Steinbach,G. Karypis,V. Kumar,文档聚类技术的比较,KDD-2000文本挖掘研讨会,2000年],一种使用背景知识(BBK)的改进的二等分k均值[A. Hotho,S. Staab,G. Stumme,Ontologies改进了文本文档的聚类,见:2003年第三届IEEE国际数据挖掘会议论文集,第541-544页]和基于频繁项集的层次聚类(FIHC)[B.C.M. Fung,K。Wang,M。Ester,《使用频繁项目集进行层次文档聚类》,见:SIAM​​国际数据挖掘会议论文集,2003年]算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号