首页> 外文会议>Association for Computational Linguistics Annual Meeting; 20070623-30; Prague(CZ) >Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification
【24h】

Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification

机译:重新思考中文分词:标记化,字符分类或分词识别

获取原文
获取原文并翻译 | 示例

摘要

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB's) into either word-boundaries (WB's) and non-word-boundaries. In Chinese, CB's are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB's are WB's.
机译:本文解决了中文分词中尚存的两个挑战。 HLT面临的挑战是找到一种鲁棒的分割方法,该方法不需要先验的词汇知识,也不需要进行广泛的培训即可适应新的数据类型。在对人类认知进行建模和获取以有效地分割单词而不使用单词素养知识的过程中所面临的挑战。我们提出了一种彻底的分词方法来应对这两个挑战。我们引入的最关键的概念是中文分词是将字符串边界(CB)分为单词边界(WB)和非单词边界。在中文中,CB被定界并分布在两个字符之间。因此,我们可以使用CB在背景字符串之间的分布特性来预测哪些CB是WB。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号