首页> 外文会议>CCF international conference on natural language processing and Chinese computing >Automatically Build Corpora for Chinese Spelling Check Based on the Input Method
【24h】

Automatically Build Corpora for Chinese Spelling Check Based on the Input Method

机译:基于输入法自动构建中文拼写检查语料库

获取原文

摘要

Chinese Spelling Check (CSC) is very important for Chinese language processing. To utilize supervised learning for CSC, one of the main challenges is that high-quality annotated corpora are not enough in building models. This paper proposes new approaches to automatically build the corpora of CSC based on the input method. We build two corpora: one is used to check the errors in the texts generated by the Pinyin input method, called p-corpus, and the other is used to check the errors in the texts generated by the voice input method, called v-corpus. The p-corpus is constructed using two methods, one is based on the conversion between Chinese characters and the sounds of the characters, and the other is based on Automatic Speech Recognition (ASR). The v-corpus is constructed based on ASR. We use the misspelled sentences in real language situation as the test set. Experimental results demonstrate that our corpora can get a better checking effect than the benchmark corpus.
机译:中文拼写检查(CSC)对于中文处理非常重要。为了在CSC中使用监督学习,主要挑战之一是高质量的带注释语料库不足以建立模型。本文提出了一种基于输入法自动构建CSC语料库的新方法。我们建立了两种语料库:一种用于检查拼音输入法生成的文本中的错误,称为p-corpus,另一种用于检查语音输入法生成的文本中的错误,称为v-corpus 。 p语料库使用两种方法构造,一种基于汉字和字符的声音之间的转换,另一种基于自动语音识别(ASR)。 v语料库是基于ASR构建的。我们使用真实语言环境中的拼写错误的句子作为测试集。实验结果表明,我们的语料库比基准语料库具有更好的检查效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号