Text Classiu0002cation in Asian Languages without Word Segmentation

机译：没有分词的亚洲语言中的文本分类

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification. With character n-gram models, our approach avoids word segmentation. However, unlike traditional ad hoc n-gram models, the statistical language modeling based approach has strong information theoretic basis and avoids explicit feature selection procedure which potentially loses significantly amount of useful information. We systematically study the key factors in language modeling and their influence on classification. Experiments on Chinese TREC and Japanese NTCIR topic detection show that the simple approach can achieve better performance compared to traditional approaches while avoiding word segmentation, which demonstrates its superiority in Asian language text classification.

机译：我们基于统计n-gram语言建模，提供了一种无需分词的亚洲语言文本分类的简单方法。特别是，我们研究了中文和日语文本分类。使用字符n元语法模型，我们的方法避免了单词分割。但是，与传统的即席n-gram模型不同，基于统计语言建模的方法具有强大的信息理论基础，并且避免了显式的特征选择过程，这可能会丢失大量有用的信息。我们系统地研究语言建模中的关键因素及其对分类的影响。对中文TREC和日语NTCIR主题检测的实验表明，与传统方法相比，该简单方法在避免单词分割的同时，可以实现更好的性能，这证明了其在亚洲语言文本分类中的优越性。

著录项

来源
《41st annual meeting of the Association for Computational Linguistics : Proceedings of the conference》|2003年|1-8|共8页
会议地点 Sapporo(JP);Sapporo(JP);Sapporo(JP)
作者
Fuchun Peng; Xiangji Huang; Dale Schuurmans; Shaojun Wang;
展开▼
作者单位

School of Computer Science, University of Waterloo, Ontario, CanadarnDepartment of Computer Science, University of Massachusetts, Amherst, MA, USA;

School of Computer Science, University of Waterloo, Ontario, Canada;

School of Computer Science, University of Waterloo, Ontario, Canada;

School of Computer Science, University of Waterloo, Ontario, Canada Department of Statistics, University of Toronto, Ontario, Canadarnf3peng, jhuang, dale, sjwang@ai.uwaterloo.ca;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类程序语言、算法语言;
关键词

相似文献

外文文献
中文文献
专利

1. Word Segmentation without Any Dictionary for Full-Text Search [J] . Tomoko Tanabe, Yasuki Lizuka Matsushita Technical Journal . 2000,第2期

机译：没有全文搜索词典的分词
2. Improving language without words: first evidence from aphasia. [J] . Marangolo P, Bonifazi S, Tomaiuolo F, Neuropsychologia . 2010,第13期

机译：不言而喻地提高语言能力：失语症的第一个证据。
3. Brief communication: natural language processing: word recognition without segmentation [J] . Khalid Saeed, Agnieszka Dardzinska Journal of the American Society for Information Science and Technology . 2001,第14期

机译：简短交流：自然语言处理：不分段的单词识别
4. Text Classioncation in Asian Languages without Word Segmentation [C] . Fuchun Peng, Xiangji Huang, Dale Schuurmans, 41st annual meeting of the Association for Computational Linguistics : Proceedings of the conference . 2003

机译：没有分词的亚洲语言中的文本分类
5. Word segmentation, word recognition, and word learning: A computational model of first language acquisition. [D] . Daland, Robert. 2009

机译：分词，单词识别和单词学习：母语习得的计算模型。
6. Text Comprehension and Oral Language as Predictors of Word-Problem Solving: Insights into Word-Problem Solving as a Form of Text Comprehension [O] . Lynn S. Fuchs, Jennifer K. Gilbert, Douglas Fuchs, -1

机译：文本理解和口头语言作为解决单词问题的预测器：洞悉作为文本理解形式的单词问题解决
7. Text Classification in Asian Languages Without Word Segmentation [O] . Peng, Fuchun, Huang, Xiangji, Schuurmans, Dale, 2003

机译：没有分词的亚洲语言中的文本分类

Text Classiu0002cation in Asian Languages without Word Segmentation

摘要

著录项

相似文献

相关主题

期刊订阅