【24h】

Research on Data Cleaning in Text Clustering

机译:文本聚类数据清理研究

获取原文

摘要

The more reasonable method of data cleaning has been proposed according to situation that data cleaning mistake away words which have distinguish capacity in text clustering pre-treatment presently. This method considers the situation of new field words happening. For the problem of rare word filtering, consider both the importance degree of the word in the whole text collection, namely word frequency, and the importance in the text in which it appears, namely weightings. So this method avoids dividing it into existed category in order to achieve the goal of filtering comparatively accurately which make result of text clustering more precise. Text clustering is made by means of C-means algorithm at last and verifying this method improves the accuracy of text clustering result.
机译:根据数据清洁错误的情况,提出了更合理的数据清洁方法,这些情况在目前在文本聚类预处理中具有区分容量的单词。这种方法考虑了发生新的现场单词的情况。对于稀有单词过滤问题,考虑整个文本集合中的单词的重要程度,即字频率,以及它出现的文本中的重要性,即加权。因此,此方法避免将其划分为存在的类别,以便达到相对准确地过滤的目标,这使得文本聚类更精确。文本群集是通过最后的C均值算法进行的,并验证此方法提高文本群集结果的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号