Automatically Build Corpora for Chinese Spelling Check Based on the Input Method

机译：基于输入法自动构建中文拼写检查语料库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Chinese Spelling Check (CSC) is very important for Chinese language processing. To utilize supervised learning for CSC, one of the main challenges is that high-quality annotated corpora are not enough in building models. This paper proposes new approaches to automatically build the corpora of CSC based on the input method. We build two corpora: one is used to check the errors in the texts generated by the Pinyin input method, called p-corpus, and the other is used to check the errors in the texts generated by the voice input method, called v-corpus. The p-corpus is constructed using two methods, one is based on the conversion between Chinese characters and the sounds of the characters, and the other is based on Automatic Speech Recognition (ASR). The v-corpus is constructed based on ASR. We use the misspelled sentences in real language situation as the test set. Experimental results demonstrate that our corpora can get a better checking effect than the benchmark corpus.

机译：中文拼写检查（CSC）对于中文处理非常重要。为了在CSC中使用监督学习，主要挑战之一是高质量的带注释语料库不足以建立模型。本文提出了一种基于输入法自动构建CSC语料库的新方法。我们建立了两种语料库：一种用于检查拼音输入法生成的文本中的错误，称为p-corpus，另一种用于检查语音输入法生成的文本中的错误，称为v-corpus 。 p语料库使用两种方法构造，一种基于汉字和字符的声音之间的转换，另一种基于自动语音识别（ASR）。 v语料库是基于ASR构建的。我们使用真实语言环境中的拼写错误的句子作为测试集。实验结果表明，我们的语料库比基准语料库具有更好的检查效果。

著录项

来源
《CCF international conference on natural language processing and Chinese computing》|2019年|471-485|共15页
会议地点
作者
Jianyong Duan; Lijian Pan; Hao Wang; Mei Zhang; Mingli Wu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Corpora; Chinese spelling check; Input method;

机译：语料库;中文拼写检查;输入法;

相似文献

外文文献
中文文献
专利

1. Building parallel corpora by automatic title alignment using length-based and text-based approaches [J] . Yang CC, Li KW Information Processing & Management . 2004,第6期

机译：使用基于长度和基于文本的方法通过自动标题对齐来构建并行语料库
2. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia [J] . Jie?Zhou, Bi-cheng?Li, Gang?Chen Frontiers of Information Technology & Electronic Engineering . 2015,第11期

机译：从中文维基百科自动建立大规模的命名实体识别语料库
3. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia [J] . Jie ZHOU, Bi-cheng LI, Gang CHEN 浙江大学学报（英文版）（C辑：计算机与电子） . 2015,第011期

机译：从中文维基百科自动建立大规模的命名实体识别语料库
4. Automatically Build Corpora for Chinese Spelling Check Based on the Input Method [C] . Jianyong Duan, Lijian Pan, Hao Wang, CCF international conference on natural language processing and Chinese computing . 2019

机译：根据输入法自动构建中文拼写检查的Corpora
5. Automatic code compliance checking in designing building envelopes [D] . Tan, Xiangyang 2008

机译：设计建筑围护结构时自动进行代码合规性检查
6. Methods of Population Spatialization Based on the Classification Information of Buildings from China’s First National Geoinformation Survey in Urban Area: A Case Study of Wuchang District Wuhan City China [O] . Linze Li, Jiansong Li, Zilong Jiang, 2018

机译：基于中国首次全国城市地理信息调查中建筑物分类信息的人口空间化方法-以武汉市武昌区为例
7. A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check [O] . Dingmin Wang, Yan Song, Jing Li, 2018

机译：杂交方法对中国拼写检查的自动语料库生成
8. SPEEDCOP: Automatic Spelling Error Detection and Correction for Large Data Bases [R] . Pollock, J. J. 1981

机译：spEEDCOp：大型数据库的自动拼写错误检测和纠正

Automatically Build Corpora for Chinese Spelling Check Based on the Input Method

摘要

著录项

相似文献

相关主题

期刊订阅