首页> 中文期刊> 《测控技术》 >一种面向中文敏感网页识别的文本分类方法

一种面向中文敏感网页识别的文本分类方法

         

摘要

提出了一种面向中文敏感网页识别的文本分类方法,主要包括中文分词、停用词表的建立、特征选择、分类器等4个部分.为丰富中文分词词库,提出了一种以词频统计为主、以人工判决为辅并标注词性的新词识别算法;提出了一种停用词表的建立算法,据此建立了含300个停用词的停用词表;采用开方拟合检验统计量方法作为特征选择方法,并确定了400维的特征词库.根据开方拟合统计量特征选择方法与朴素贝叶斯分类器的特点,加入待分类网页文本中所含特征项数目与特征集维数的比值以及特征项数目与文本所含词汇数目的比值两个影响因子,对朴素贝叶斯分类器进行了改进.考虑到不同的人群对敏感概念的主观理解差异较大,将待识别网页的敏感度值作为分类器的输出.实验结果表明,与现有的文本分类方法相比,所提出的文本分类方法可以获得更好的识别效果.%A text classification method for Chinese pornographic web recognition is proposed. It consists of four key components: automatic Chinese word seginentation, stop-word-list establishment, feature selection, text classification, etc. respectively discussed. To enrich the dictionary of Chinese word segmentation system, a new word identification algorithm is proposed, which is mainly based on word frequency statistics, and supplemented by artificial decision as well as Chinese part of tagging. On the basis of the Chinese stop-word-list selection method proposed, a stop-word-list containing 300 stop words is established. Subsequently, using the CHI square method, a 400-dimension feature vector is decided. In addition, by analyzing Naive Bayes classifier and CHI square method, two influencing factors are added. One is the ratio of included features number and selected feature number; the other is the ratio of included feature number and included unique words. Given the concept of different people' s subjective understanding of pornographic is quit different, the pornographic value of a web page is used as the output of the classifier. The experimental results show that the proposed method can achieve better classification performance, compared with the existing text classification method.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号