首页> 外文期刊>International Journal on Document Analysis and Recognition >Optical character recognition with neural networks and post-correction with finite state methods
【24h】

Optical character recognition with neural networks and post-correction with finite state methods

机译:具有神经网络的光学字符识别和具有有限状态方法的后校正

获取原文
获取原文并翻译 | 示例
       

摘要

The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy)) and Tesseract (https://github.com/tesseract-ocr/tesseract),), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.
机译:光学字符识别(OCR)芬兰报纸和期刊语料库的历史部分的质量对于对OCRED数据的可靠搜索和科学研究来说是相当低的。用商业软件实现的语料库的估计字符错误率(CER)在8到13%之间。早先尝试使用开源软件培训高质量的OCR模型,如Ocropy(https://github.com/tmbdev/croctoct))和tesseract(https://github.com/tesseract-ocr/tessactact ),),但到目前为止,这些方法都没有成功培训识别语料库中的所有数据的混合模型,这对于语料库的有效重新逐步至关重要。难以在芬兰(芬兰和瑞典)的两种主要语言中印刷的难度是,两个字体系列(Blackletter和Antiqua)印刷。在本文中,我们探讨了具有深度神经网络(DNN)的各种OCR模型的培训。首先,我们为我们的数据找到了最佳DNN,并提供了额外的培训数据,成功培训了高质量的混合语言模型。此外,我们通过不同的模型组合重新审视信心投票对OCR结果的影响。最后,我们在新的OCR结果上执行纠正并执行错误分析。结果表明,精度显着提高,导致芬兰试验组上的芬兰和2.7%的1.7%CER。该研究的最大成绩是成功培训整个语料库的混合语言模型,并找到进一步提高结果的投票设置。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号