首页> 外文会议>International Conference on Fuzzy Systems and Knowledge Discovery >A Feature Selection Method for Document Clustering Based on Part-of-Speech and Word Co-Occurrence
【24h】

A Feature Selection Method for Document Clustering Based on Part-of-Speech and Word Co-Occurrence

机译:基于言语和单词共同发生的文档聚类特征选择方法

获取原文

摘要

Feature selection is a process "which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and provides a better understanding for the data and learning process. However, few modern feature selection approaches take the advantage of features' context information. Based on this analysis, we propose a novel feature selection method based on part-of-speech and word co-occurrence. According the components of Chinese document text, we utilize the words' part-of-speech attributes to filter lots of meaningless terms. Then we define and use cooccurrence words by their part-of-speech to select features. In the evaluating process, we use the text corpus from Sogou Lab to do some experiments and use Entropy and Precision as criteria to give an objective evaluation of document clustering performance. The results show that our method can select better features and get a more pleasant clustering performance.
机译:特征选择是根据一些规则选择从原始功能集的子集的过程。所选功能保留了原始物理含义,并为数据和学习过程提供了更好的理解。但是,很少有现代特征选择方法采取优势功能的上下文信息。基于此分析,我们提出了一种基于言语和单词共同发生的新颖特征选择方法。根据中文文档文本的组成部分,我们利用了“词语部分的词语”过滤许多毫无意义的术语。然后我们通过演讲来定义和使用Cooccurrence单词来选择功能。在评估过程中,我们使用Sogou Lab的文本语料库进行一些实验,并使用熵和精确度作为标准。对文档聚类性能的客观评估。结果表明,我们的方法可以选择更好的功能并获得更令人愉快的聚类性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号