首页> 中文期刊>中文信息学报 >一种基于多特征因子改进的中文文本分类算法

一种基于多特征因子改进的中文文本分类算法

     

摘要

采用向量空间模型(vector space model,VSM)表示网页文本,通过在CHI(Chi-Square)特征选择算法中引入频度 、集中度 、分散度 、位置信息这四个特征因子,并考虑词长和位置特征因子改进TF-IDF权重计算公式,提出了PCHI-PTFIDF(promoted CHI-promoted TF-IDF)算法用于中文文本分类.改进算法能降维得到分类能力更强的特征项集 、更精确地反映特征项的权重分布情况.结果显示,与使用传统CHI和传统TF-IDF的文本分类算法相比,PCHI-PTFIDF算法的宏F1值平均提高了10%.%In the framework of the vector space model (VSM ) ,a new PCHI-PTFIDF (promoted CHI-promoted TFIDF)method based on feature selection and weight calculation is proposed .First ,the factors of frequency ,con-centration ,dispersion and location are introduced to CHi-Square based feature selection .Then ,the TF-IDF weight is proposed to be optimized by the length and location factors of text terms .The proposed method can reduce the di-mensions of the features with better classification ability ,and produce better estimation of the weight distribution . The experimental results show that ,compared with the algorithm using the traditional CHI and traditional TFIDF , the PCHI-PTFIDF method achieves 10% improvement in Macro-F1 on average .

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号