首页> 外文会议>Pacific-Asia conference on knowledge discovery and data mining >Forgetting Word Segmentation in Chinese Text Classification with L1-Regularized Logistic Regression
【24h】

Forgetting Word Segmentation in Chinese Text Classification with L1-Regularized Logistic Regression

机译:L1正则Logistic回归在中文文本分类中忘记分词

获取原文

摘要

Word segmentation is commonly a preprocessing step for Chinese text representation in building a text classification system. We have found that Chinese text representation based on segmented words may lose some valuable features for classification, no matter the segmented results are correct or not. To preserve these features, we propose to use character-based N-gram to represent the Chinese text in a larger scale feature space. Considering the sparsity problem of the N-gram data, we suggest the L1-regularized logistic regression (L1-LR) model to classify Chinese text for better generalization and interpretation. The experimental results demonstrate our proposed method can get better performance than those state-of-the-art methods. Further qualitative analysis also shows that character-based N-gram representation with L1-LR is reasonable and effective for text classification.
机译:分词通常是构建文本分类系统中中文文本表示的预处理步骤。我们发现基于分割词的中文文本表示可能会失去一些有价值的分类功能,无论分割结果是否正确。为了保留这些特征,我们建议使用基于字符的N-gram在更大的特征空间中表示中文文本。考虑到N-gram数据的稀疏性问题,我们建议使用L1正则化逻辑回归(L1-LR)模型对中文文本进行分类,以更好地进行概括和解释。实验结果表明,我们提出的方法可以比那些最新方法获得更好的性能。进一步的定性分析还表明,使用L1-LR的基于字符的N元语法表示法对于文本分类是合理且有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号