首页> 中文期刊>中文信息学报 >印刷体汉字识别后处理方法的研究

印刷体汉字识别后处理方法的研究

     

摘要

高阶N-gram语言模型在OCR后处理方面有着广泛的应用,但也面临着因模型复杂度大导致的数据稀疏,以及耗费较多的时空资源等问题.该文针对印刷体汉字识别的后处理,提出了一种基于字节的语言模型的后处理算法.通过采用字节作为语言模型的基本表示单位,模型的复杂度大大降低,从而数据稀疏问题得到很大程度上缓解.实验证明,采用基于字节的语言模型的后处理系统能够以极少的时空开销获取很好的识别性能.在有部分分割错误的测试集上,正确率从88.67%提高到了98.32%,错误率下降了85.18%,运行速度较基于字以及基于词的系统有了大幅的提升,提高了后处理系统的综合性能;与目前常用的基于词的语言模型后处理系统相比,新系统能够节省95%的运行时间和98%的内存资源,但系统识别率仅降低了1.11%.%In Chinese OCR post-processing, the high-order Chinese n-gram language models, such as word based tri-gram and four-gram is still a challenging issue because of the data sparseness issue and large memory cost led by big model size. In this paper, we focus on the post-processing of printed Chinese character recognition and propose a byte-based language model. By choosing byte as the representing unit of language model, we achieve a remarkable reduction of model size which overcomes the sparseness problem to a great extent. The experimental results show that the new language model based on byte works very well with higher performance and lowest time and space costs. For the test set with segmentation errors, the recognition accuracy increases from 88. 67% to 98. 32% , which means 85. 18% error reduction. Compared with the system using traditional word based tri-gram, the new system saves 95% time cost and nearly 98% memory cost at almost no cost in the accuracy performance.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号