Adaptive compression-based models of Chinese text

机译：基于自适应压缩的中文文本模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.

机译：与诸如英语的小字母语言相比，诸如中文的大字母语言在语言建模方面存在不同的问题。在本文中，我们描述了基于部分预测匹配（PPM）文本压缩方案的中文文本自适应模型，该模型在按顺序处理文本时学习该语言。我们描述了几种基于字符，基于单词和词性（POS）的PPM变体，它们在压缩率上比现有模型有显着提高。有趣的是，中文文本的结果与英语文本相比有所不同，基于字符的模型优于基于单词的模型和基于POS的模型，而不是相反。然后，我们探索这些模型在中文分词任务中的表现如何。

著录项

来源
《International Conference on Audio, Language and Image Processing》|2014年|874-881|共8页
会议地点
作者
Teahan William J.; Peiliang Wu; Wei Liu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
data compression; natural language processing; text analysis; Chinese text; Chinese word segmentation; English text; adaptive compression-based model; character-based variants; part-of-speech based variants; partial predictive match text compression scheme; word-based variants; Adaptation models; Context; Context modeling; Encoding; Hidden Markov models; Natural language processing; Predictive models;

机译：数据压缩;自然语言处理;文本分析;中文文本;中文分词;英语文本;基于自适应压缩的模型;基于字符的变体;基于词性的变体;部分预测匹配文本压缩方案;基于词的变体适应模型上下文上下文建模编码隐马尔可夫模型自然语言处理预测模型;

相似文献

外文文献
中文文献
专利

1. Text Classification Using Compression-Based Dissimilarity Measures [J] . Coutinho David Pereira, Figueiredo Mario A. T. International Journal of Pattern Recognition and Artificial Intelligence . 2015,第5期

机译：使用基于压缩的差异度量进行文本分类
2. A compression-based text steganography method [J] . Esra Satir, Hakan Isik The Journal of Systems and Software . 2012,第10期

机译：基于压缩的文字隐写方法
3. A Convolutional Neural Network-Based Chinese Text Detection Algorithm via Text Structure Modeling [J] . Xiaohang Ren, Yi Zhou, Jianhua He, Multimedia, IEEE Transactions on . 2017,第3期

机译：基于卷积神经网络的文本结构建模的中文文本检测算法
4. Adaptive compression-based models of Chinese text [C] . Teahan William J., Peiliang Wu, Wei Liu International Conference on Audio, Language and Image Processing . 2014

机译：基于自适应压缩的中文文本模型
5. Models of Authorship and Text-making in Early China [D] . Zhang, Hanmo 2012

机译：中国早期的作者和文本制作模式
6. The Attraction of Visual Attention to Texts in Real-World Scenes: Are Chinese Texts Attractive to Non-Chinese Speakers? [O] . Hsueh-Cheng Wang, Marc Pomplun 2011

机译：视觉上吸引现实世界场景中的文字的吸引力：中文文字是否对非中文说话者有吸引力？
7. On compression-based text classification [O] . Yuval Marton, Ning Wu, Lisa Hellerstein 2005

机译：基于压缩的文本分类

Adaptive compression-based models of Chinese text

摘要

著录项

相似文献

相关主题

期刊订阅