首页> 外文会议>Advances in Information Retrieval >Chinese Text Categorization Based on the Binary Weighting Model with Non-binary Smoothing
【24h】

Chinese Text Categorization Based on the Binary Weighting Model with Non-binary Smoothing

机译:基于非二进制平滑的二进制加权模型的中文文本分类

获取原文

摘要

In Text Categorization (TC) based on the vector space model, feature weighting is vital for the categorization effectiveness. Various non-binary weighting schemes are widely used for this purpose. By emphasizing the category discrimination capability of features, the paper firstly puts forward a new weighting scheme TF*IDF*IG. Upon the fact that refined statistics may have more chance to meet sparse data problem, we re-evaluate the role of the Binary Weighting Model (BWM) in TC for further consideration. As a consequence, a novel approach named the Binary Weighting Model with Non-Binary Smoothing (BWM-NBS) is then proposed so as to overcome the drawback of BWM. A TC system for Chinese texts using words as features is implemented. Experiments on a large-scale Chinese document collection with 71,674 texts show that the F1 metric of categorization performance of BWM-NBS gets to 94.9% in the best case, which is 26.4% higher than that of TF*IDF, 19.1% higher than that of TF*IDF*IG, and 5.8% higher than that of BWM under the same condition. Moreover, BWM-NBS exhibits the strong stability in categorization performance.
机译:在基于向量空间模型的文本分类(TC)中,特征权重对于分类有效性至关重要。各种非二进制加权方案被广泛用于此目的。通过强调特征的类别识别能力,本文首先提出了一种新的加权方案TF * IDF * IG。基于精炼统计可能有更多机会解决稀疏数据问题的事实,我们重新评估了二进制加权模型(BWM)在TC中的作用,以供进一步考虑。结果,提出了一种新的方法,称为具有非二进制平滑的二进制加权模型(BWM-NBS),从而克服了BWM的缺点。实现了以词为特征的中文文本TC系统。对71,674条文本进行的大规模中文文档收集实验表明,在最佳情况下,BWM-NBS的F1度量分类性能达到94.9%,比TF * IDF高26.4%,比TF * IDF高19.1%。在相同条件下,TF * IDF * IG的值比BWM高5.8%。此外,BWM-NBS在分类性能方面显示出很强的稳定性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号