Forgetting Word Segmentation in Chinese Text Classification with L1-Regularized Logistic Regression

机译：L1正则Logistic回归在中文文本分类中忘记分词

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word segmentation is commonly a preprocessing step for Chinese text representation in building a text classification system. We have found that Chinese text representation based on segmented words may lose some valuable features for classification, no matter the segmented results are correct or not. To preserve these features, we propose to use character-based N-gram to represent the Chinese text in a larger scale feature space. Considering the sparsity problem of the N-gram data, we suggest the L1-regularized logistic regression (L1-LR) model to classify Chinese text for better generalization and interpretation. The experimental results demonstrate our proposed method can get better performance than those state-of-the-art methods. Further qualitative analysis also shows that character-based N-gram representation with L1-LR is reasonable and effective for text classification.

机译：分词通常是构建文本分类系统中中文文本表示的预处理步骤。我们发现基于分割词的中文文本表示可能会失去一些有价值的分类功能，无论分割结果是否正确。为了保留这些特征，我们建议使用基于字符的N-gram在更大的特征空间中表示中文文本。考虑到N-gram数据的稀疏性问题，我们建议使用L1正则化逻辑回归（L1-LR）模型对中文文本进行分类，以更好地进行概括和解释。实验结果表明，我们提出的方法可以比那些最新方法获得更好的性能。进一步的定性分析还表明，使用L1-LR的基于字符的N元语法表示法对于文本分类是合理且有效的。

著录项

来源
《Pacific-Asia conference on knowledge discovery and data mining》|2013年|245-255|共11页
会议地点
作者
Qiang Fu; Xinyu Dai; Shujian Huang; Jiajun Chen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Text classification; Text representation; Chinese Character-based N-gram; L1-regularized logistic regression;

机译：文字分类;文字表示;基于汉字的N-gram; L1正则逻辑回归;

相似文献

外文文献
中文文献
专利

1. maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification [J] . Timothy P. Jurka R News . 2012,第1期

机译：maxent：用于低内存多项式逻辑回归的R包，支持半自动文本分类
2. maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification [J] . Timothy P. Jurka The R Journal . 2012,第1期

机译：maxent：用于低内存多项式逻辑回归的R包，支持半自动文本分类
3. Word segmentation of handwritten text using supervised classification techniques [J] . Yi Sun, Timothy S. Butler, Alex Shafarenko, Applied Soft Computing . 2007,第1a4期

机译：使用监督分类技术的手写文本分词
4. Forgetting Word Segmentation in Chinese Text Classification with L1-Regularized Logistic Regression [C] . Qiang Fu, Xinyu Dai, Shujian Huang, PAKDD 2013 . 2013

机译：忘记了中文文本分类中的单词分割，具有L1 - 正则逻辑回归
5. General Penalized Logistic Regression for Gene Selection in High-Dimensional Microarray Data Classification [D] . Bonney, Derrick Kwesi. 2020

机译：高维微阵列数据分类中基因选择的一般惩罚逻辑回归
6. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text [O] . Ying Xiong, Zhongmin Wang, Dehuan Jiang, 2019

机译：用于临床文本的细粒度中文分词和词性标注语料库
7. Combining Prediction by Partial Matching and Logistic Regression for Thai Word Segmentation [O] . Ohm Sornil 2008

机译：局部匹配与Logistic回归相结合的预测泰语分词

Forgetting Word Segmentation in Chinese Text Classification with L1-Regularized Logistic Regression

摘要

著录项

相似文献

相关主题

期刊订阅