首页> 外文期刊>Expert Systems with Application >A logistic regression-based smoothing method for Chinese text categorization
【24h】

A logistic regression-based smoothing method for Chinese text categorization

机译:基于逻辑回归的中文文本分类平滑方法

获取原文
获取原文并翻译 | 示例
           

摘要

Automatic Chinese text classification is an important and a well-known technology in the field of machine learning. The first step for solving Chinese text categorization problems is to tokenize the Chinese words from a sequence of non-segmented sentences. However, previous literatures often employ a Chinese word tokenizer that was trained with different sources and then perform the conventional text classification approaches. However, these taggers are not perfect and often provide incorrect word boundary information. In this paper, we propose an N-gram-based language model which takes word relations into account for Chinese text categorization without Chinese word tokenizer. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms traditional methods at least 11% on micro-average F-measure.
机译:中文文本自动分类是机器学习领域的一项重要且众所周知的技术。解决中文文本分类问题的第一步是从一系列非分段句子中对中文单词进行标记。但是,以前的文献经常使用经过不同来源训练的中文单词标记器,然后执行常规的文本分类方法。但是,这些标记器并不完美,通常会提供不正确的单词边界信息。在本文中,我们提出了一种基于N元语法的语言模型,该模型考虑了单词关系,而无需使用中文单词分词器就可以对中文文本进行分类。为了防止出现语音偏差,我们还提出了一种新的基于逻辑回归的平滑方法,以提高准确性。实验结果表明,我们的方法在微观平均F度量方面比传统方法至少好11%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号