一种面向中文敏感网页识别的文本分类方法

陈欣; 张菁; 李晓光; 卓力

首页> 中文期刊> 《测控技术》 >一种面向中文敏感网页识别的文本分类方法

一种面向中文敏感网页识别的文本分类方法

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

提出了一种面向中文敏感网页识别的文本分类方法,主要包括中文分词、停用词表的建立、特征选择、分类器等4个部分.为丰富中文分词词库,提出了一种以词频统计为主、以人工判决为辅并标注词性的新词识别算法;提出了一种停用词表的建立算法,据此建立了含300个停用词的停用词表;采用开方拟合检验统计量方法作为特征选择方法,并确定了400维的特征词库.根据开方拟合统计量特征选择方法与朴素贝叶斯分类器的特点,加入待分类网页文本中所含特征项数目与特征集维数的比值以及特征项数目与文本所含词汇数目的比值两个影响因子,对朴素贝叶斯分类器进行了改进.考虑到不同的人群对敏感概念的主观理解差异较大,将待识别网页的敏感度值作为分类器的输出.实验结果表明,与现有的文本分类方法相比,所提出的文本分类方法可以获得更好的识别效果.%A text classification method for Chinese pornographic web recognition is proposed. It consists of four key components: automatic Chinese word seginentation, stop-word-list establishment, feature selection, text classification, etc. respectively discussed. To enrich the dictionary of Chinese word segmentation system, a new word identification algorithm is proposed, which is mainly based on word frequency statistics, and supplemented by artificial decision as well as Chinese part of tagging. On the basis of the Chinese stop-word-list selection method proposed, a stop-word-list containing 300 stop words is established. Subsequently, using the CHI square method, a 400-dimension feature vector is decided. In addition, by analyzing Naive Bayes classifier and CHI square method, two influencing factors are added. One is the ratio of included features number and selected feature number; the other is the ratio of included feature number and included unique words. Given the concept of different people' s subjective understanding of pornographic is quit different, the pornographic value of a web page is used as the output of the classifier. The experimental results show that the proposed method can achieve better classification performance, compared with the existing text classification method.

著录项

来源
《测控技术》 |2011年第5期|27-3140|共6页
作者
陈欣; 张菁; 李晓光; 卓力;
展开▼
作者单位

北京工业大学,信号与信息处理研究室,北京,100124;

北京工业大学,信号与信息处理研究室,北京,100124;

北京工业大学,信号与信息处理研究室,北京,100124;

北京工业大学,信号与信息处理研究室,北京,100124;

展开▼
原文格式 PDF
正文语种 chi
中图分类信息处理（信息加工）;
关键词
中文敏感网页识别; 新词识别; 停用词表建立; CHI统计; 朴素贝叶斯分类器;

相似文献

中文文献
外文文献
专利

1. 面向不良文本信息的中文网页分类方法 [J] . 黄旭 ,朱艳琴 ,罗喜召 . 微电子学与计算机 . 2008,第6期
2. 基于网页结构与链接关系的中文文本分类方法 [J] . 郭晓 ,蒋宗礼 . 现代电子技术 . 2010,第022期
3. 一种面向海量中文文本的典型类属关系识别方法 [J] . 刘琦 ,肖仰华 ,汪卫 . 计算机工程 . 2015,第002期
4. 面向敏感网页识别的网页内容获取方案的设计与实现 [J] . 陈欣 ,卓力 . 测控技术 . 2009,第005期
5. 一种基于VSM的中文网页分类方法 [J] . 孔令成 ,郑诚 ,吴永俊 . 微型机与应用 . 2009,第017期
6. 一种基于粗集与贝叶斯分类器的中文网页分类方法 [C] . 张东娜 ,彭宏 ,吴铁峰 . 中国计算机学会第一届全国Web信息系统及其应用学术会议 . 2004
7. 基于文本和图像内容分析的中文敏感网页识别关键技术研究 [A] . 陈欣 . 2010

一种面向中文敏感网页识别的文本分类方法

摘要

著录项

相似文献

相关主题

期刊订阅