首页> 外文期刊>ACM transactions on Asian language information processing >The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context
【24h】

The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

机译:单词的左右上下文:具有最小上下文的中文音节分词重叠

获取原文
获取原文并翻译 | 示例
       

摘要

Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words, and 2. select the most likely character words for each syllable word. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, in practice, most Pinyin users prefer progressive text entry in several short chunks, mainly in one or two words each (most Chinese words consist of two or more characters). Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping syllable words. In such cases, a conversion system often selects the boundary of a word with the highest frequency. Short chunk input is even more popular on platforms with limited computing power, such as mobile phones. Based on the observation that the relative strength of a word can be quite different when calculated leftwards or rightwards, we propose a simple division of the word context into the left context and the right context. Furthermore, we design a double ranking strategy for each word to reduce the number of errors in Step 1. Our strategy is modeled as the minimum feedback arc set problem on bipartite tournament with approximate solutions derived from genetic algorithm. Experiments show that, compared to the frequency-based method (FBM) (low memory and fast) and the conditional random fields (CRF) model (larger memory and slower), our double ranking strategy has the benefits of less memory and low power requirement with competitive performance. We believe a similar strategy could also be adopted to disambiguate conflicting linguistic patterns effectively.
机译:由于中文音节可以对应许多字符(谐音),因此对于中文语音输入法(CPIM)而言,音节到字符的转换任务颇具挑战性。 CPIM中通常有两个阶段:1.将音节序列分割为音节单词,以及2.为每个音节单词选择最可能的字符单词。 CPIM通常假定输入是完整的句子,并根据格式正确的语料库评估性能。然而,在实践中,大多数拼音用户更喜欢以几个短块进行渐进式文本输入,每个块主要是一个或两个单词(大多数中文单词由两个或多个字符组成)。短块没有提供足够的上下文来执行最佳的音节到字符的转换,尤其是当块由重叠的音节词组成时。在这种情况下,转换系统通常会选择频率最高的单词的边界。短块输入在手机等计算能力有限的平台上更为流行。基于观察到一个单词的相对强度在向左或向右计算时可以有很大的不同,我们建议将单词上下文分为左上下文和右上下文的简单划分。此外,我们为每个单词设计了一个双重排名策略,以减少步骤1中的错误数量。我们的策略被建模为二分锦标赛上的最小反馈弧集问题,并采用了遗传算法得出的近似解。实验表明,与基于频率的方法(FBM)(低内存和快速)和条件随机字段(CRF)模型(较大的内存和较慢的)相比,我们的双重排序策略具有以下优点:内存更少,功耗更低具有竞争优势。我们认为,也可以采用类似的策略来有效消除冲突的语言模式的歧义。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号