首页> 外文会议>International Conference on Asian Language Processing >Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion
【24h】

Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion

机译:使用互信息标准设计汉语拼音到字符转换的有效词典

获取原文

摘要

Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.
机译:Pinyin-to-Charact(P2C)转换主要用于将汉字输入计算机。它的主要问题是同音词语,通过利用词典和n克语言模型(LM)提供的上下文信息来解决。我们对最先进的P2C技术的调查揭示了对它们的常规优化方法几乎基于最小化文本困惑,然而它与P2C性能的优化并不直接相关。因此,我们建议使用新的优化标准:文本语料库和拼音脚本之间的互信息(MI),要做自我监控的单词分段,构建词汇并估计N-GRAM LM,然后使用它们来构建P2C系统。我们实现了使用报纸语料库的P2C系统。与使用手工遗产的基线系统相比,使用手工尾声和基于困惑的优化词典,我们的系统分别进行了相对19.7%和10.3%的错误减少了测试语料库。结果表明我们提案的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号