【24h】

Text Classiu0002cation in Asian Languages without Word Segmentation

机译:没有分词的亚洲语言中的文本分类

获取原文
获取原文并翻译 | 示例

摘要

We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification. With character n-gram models, our approach avoids word segmentation. However, unlike traditional ad hoc n-gram models, the statistical language modeling based approach has strong information theoretic basis and avoids explicit feature selection procedure which potentially loses significantly amount of useful information. We systematically study the key factors in language modeling and their influence on classification. Experiments on Chinese TREC and Japanese NTCIR topic detection show that the simple approach can achieve better performance compared to traditional approaches while avoiding word segmentation, which demonstrates its superiority in Asian language text classification.
机译:我们基于统计n-gram语言建模,提供了一种无需分词的亚洲语言文本分类的简单方法。特别是,我们研究了中文和日语文本分类。使用字符n元语法模型,我们的方法避免了单词分割。但是,与传统的即席n-gram模型不同,基于统计语言建模的方法具有强大的信息理论基础,并且避免了显式的特征选择过程,这可能会丢失大量有用的信息。我们系统地研究语言建模中的关键因素及其对分类的影响。对中文TREC和日语NTCIR主题检测的实验表明,与传统方法相比,该简单方法在避免单词分割的同时,可以实现更好的性能,这证明了其在亚洲语言文本分类中的优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号