首页> 外文会议>Advances in Information Retrieval >Chinese Text Categorization Based on the Binary Weighting Model with Non-binary Smoothing

【24h】

Chinese Text Categorization Based on the Binary Weighting Model with Non-binary Smoothing

机译：基于非二进制平滑的二进制加权模型的中文文本分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In Text Categorization (TC) based on the vector space model, feature weighting is vital for the categorization effectiveness. Various non-binary weighting schemes are widely used for this purpose. By emphasizing the category discrimination capability of features, the paper firstly puts forward a new weighting scheme TF*IDF*IG. Upon the fact that refined statistics may have more chance to meet sparse data problem, we re-evaluate the role of the Binary Weighting Model (BWM) in TC for further consideration. As a consequence, a novel approach named the Binary Weighting Model with Non-Binary Smoothing (BWM-NBS) is then proposed so as to overcome the drawback of BWM. A TC system for Chinese texts using words as features is implemented. Experiments on a large-scale Chinese document collection with 71,674 texts show that the F1 metric of categorization performance of BWM-NBS gets to 94.9% in the best case, which is 26.4% higher than that of TF*IDF, 19.1% higher than that of TF*IDF*IG, and 5.8% higher than that of BWM under the same condition. Moreover, BWM-NBS exhibits the strong stability in categorization performance.

机译：在基于向量空间模型的文本分类（TC）中，特征权重对于分类有效性至关重要。各种非二进制加权方案被广泛用于此目的。通过强调特征的类别识别能力，本文首先提出了一种新的加权方案TF * IDF * IG。基于精炼统计可能有更多机会解决稀疏数据问题的事实，我们重新评估了二进制加权模型（BWM）在TC中的作用，以供进一步考虑。结果，提出了一种新的方法，称为具有非二进制平滑的二进制加权模型（BWM-NBS），从而克服了BWM的缺点。实现了以词为特征的中文文本TC系统。对71,674条文本进行的大规模中文文档收集实验表明，在最佳情况下，BWM-NBS的F1度量分类性能达到94.9％，比TF * IDF高26.4％，比TF * IDF高19.1％。在相同条件下，TF * IDF * IG的值比BWM高5.8％。此外，BWM-NBS在分类性能方面显示出很强的稳定性。

著录项

来源
《Advances in Information Retrieval 》|2003年|p.408-419|共12页
会议地点
作者
Xue Dejun; Sun Maosong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术 ;
关键词

相似文献

外文文献
中文文献
专利

1. Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization [J] . Badawi Dima, Altincay Hakan Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies . 2017 ,第2期

机译：通过调整术语加权方案来利用二进制文本分类的基数统计来使用临时
2. A logistic regression-based smoothing method for Chinese text categorization [J] . Show-Jane Yen, Yue-Shi Lee, Jia-Ching Ying, Expert Systems with Application . 2011 ,第9期

机译：基于逻辑回归的中文文本分类平滑方法
3. A medoid‑based weighting scheme for nearest‑neighbor decision rule toward effective text categorization [J] . Avideep Mukherjee, Tanmay Basu SN Applied Sciences . 2020 ,第6期

机译：一种基于medoid的最近邻决策规则朝着有效文本分类的加权方案
4. Chinese Text Categorization Based on the Binary Weighting Model with Non-binary Smoothing [C] . Xue Dejun, Sun Maosong, Lecture Notes in Computer Science 2633 European Conference on Information Retrieval Research . 2003

机译：基于非二进制平滑的二进制加权模型的中文文本分类
5. Non-binary coded modulation for FMF-based coherent optical transport networks. [D] . Lin, Changyu. 2016

机译：基于FMF的相干光传输网络的非二进制编码调制。
6. Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling [O] . Aytuğ Onan 2018

机译：基于集合修剪和优化主题建模的生物医学文本分类
7. A medoid-based weighting scheme for nearest-neighbor decision rule toward effective text categorization [O] . Avideep Mukherjee, Tanmay Basu 2020

机译：关于有效文本分类的最近邻决策规则的基于贝贝的权重方案

Chinese Text Categorization Based on the Binary Weighting Model with Non-binary Smoothing

摘要

著录项

相似文献

相关主题

期刊订阅