Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification

机译：重新思考中文分词：标记化，字符分类或分词识别

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB's) into either word-boundaries (WB's) and non-word-boundaries. In Chinese, CB's are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB's are WB's.

机译：本文解决了中文分词中尚存的两个挑战。 HLT面临的挑战是找到一种鲁棒的分割方法，该方法不需要先验的词汇知识，也不需要进行广泛的培训即可适应新的数据类型。在对人类认知进行建模和获取以有效地分割单词而不使用单词素养知识的过程中所面临的挑战。我们提出了一种彻底的分词方法来应对这两个挑战。我们引入的最关键的概念是中文分词是将字符串边界（CB）分为单词边界（WB）和非单词边界。在中文中，CB被定界并分布在两个字符之间。因此，我们可以使用CB在背景字符串之间的分布特性来预测哪些CB是WB。

著录项

来源
《Association for Computational Linguistics Annual Meeting; 20070623-30; Prague(CZ)》|2007年|P.2.65-2.68|共4页
会议地点 Prague(CZ)
作者
Chu-Ren Huang; Petr Simon; Shu-Kai Hsieh; Laurent Prevot;
展开▼
作者单位

Institute of Linguistics Academia Sinica,Taiwan;

DoFLAL NIU, Taiwan;

CLLE-ERSS, CNRS Universite de Toulouse, France;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类程序语言、算法语言;
关键词

相似文献

外文文献
中文文献
专利

1. Joint Chinese Word Segmentation and POS Tagging Using an Error-Driven Word-Character Hybrid Model [J] . Canasai KRUKNGKRA, Kiyotaka UCHIMOTO, Junichi KAZAMA, IEICE Transactions on Information and Systems . 2009,第12期

机译：使用错误驱动的字-字符混合模型的联合中文分词和POS标记
2. Annotation and Classification of Three-Character Chinese Synthetic Words [J] . Jia Lu, Masayuki Asahara, Yuji Matsumoto International journal of computer processing of languages . 2008,第2期

机译：三字符汉语合成词的注释与分类
3. Visually and Phonologically Similar Characters in Incorrect Chinese Words: Analyses, Identification, and Applications [J] . C.-L. LIU, M.-H. LAI, K.-W. TIEN, ACM transactions on Asian language information processing . 2011,第2期

机译：错误中文单词的视觉和语音相似字符：分析，识别和应用
4. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification [C] . Chu-Ren Huang, Petr Simon, Shu-Kai Hsieh, Annual Meeting of the Association for Computational Linguistics . 2007

机译：重新思考中文词分割：令牌化，字符分类或Wordbreak识别
5. The time course of semantic activation in reading Chinese two -character words [D] . Wong, Kin Fai Ellick 2000

机译：汉语两个汉字阅读中语义激活的时程
6. Functional Anatomy of Recognition of Chinese Multi-Character Words: Convergent Evidence from Effects of Transposable Nonwords Lexicality and Word Frequency [O] . Nan Lin, Xi Yu, Ying Zhao, -1

机译：汉语多字符词识别的功能解剖：可转位非词词汇性和词频影响的融合证据
7. Word-Context Character Embeddings for Chinese Word Segmentation [O] . Hao Zhou, Zhenting Yu, Yue Zhang, 2017

机译：中文语境字符嵌入中文字分割

Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification

摘要

著录项

相似文献

相关主题

期刊订阅