...
首页> 外文期刊>Computational linguistics >A stochastic finite-state word-segmentation algorithm for Chinese
【24h】

A stochastic finite-state word-segmentation algorithm for Chinese

机译:中文的随机有限状态分词算法

获取原文
获取原文并翻译 | 示例

摘要

The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and--since the primary intended application of this model is to text-to-speech synthesis--provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation 'Tudgments" with the judgments of a pool of human segmenters, and the system is shown to perform quite well.
机译:对于任何NLP任务,文本分析的初始阶段通常涉及将输入标记化为单词。对于像英语这样的语言,可以近似地假定单词边界是由空格或标点符号给定的。另一方面,在包括中文在内的各种亚洲语言中,绝不使用空格来分隔单词,因此必须使用词法信息来“重建”单词边界信息。在本文中,我们提出了一个随机的有限状态模型,其中基本的主力是加权有限状态传感器。该模型将中文文本分为字典条目和由各种生产性词汇过程派生的单词,并且由于该模型的主要目的是用于文本到语音合成,因此提供了这些单词的发音。我们通过将细分“ Tudgments”与一组人类细分者的判断进行比较来评估该系统的性能,并且该系统表现良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号