Optical character recognition with neural networks and post-correction with finite state methods

Drobac Senka; Linden Krister

首页> 外文期刊>International Journal on Document Analysis and Recognition >Optical character recognition with neural networks and post-correction with finite state methods

【24h】

Optical character recognition with neural networks and post-correction with finite state methods

机译：具有神经网络的光学字符识别和具有有限状态方法的后校正

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy)) and Tesseract (https://github.com/tesseract-ocr/tesseract),), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.

机译：光学字符识别（OCR）芬兰报纸和期刊语料库的历史部分的质量对于对OCRED数据的可靠搜索和科学研究来说是相当低的。用商业软件实现的语料库的估计字符错误率（CER）在8到13％之间。早先尝试使用开源软件培训高质量的OCR模型，如Ocropy（https://github.com/tmbdev/croctoct））和tesseract（https：//github.com/tesseract-ocr/tessactact ），），但到目前为止，这些方法都没有成功培训识别语料库中的所有数据的混合模型，这对于语料库的有效重新逐步至关重要。难以在芬兰（芬兰和瑞典）的两种主要语言中印刷的难度是，两个字体系列（Blackletter和Antiqua）印刷。在本文中，我们探讨了具有深度神经网络（DNN）的各种OCR模型的培训。首先，我们为我们的数据找到了最佳DNN，并提供了额外的培训数据，成功培训了高质量的混合语言模型。此外，我们通过不同的模型组合重新审视信心投票对OCR结果的影响。最后，我们在新的OCR结果上执行纠正并执行错误分析。结果表明，精度显着提高，导致芬兰试验组上的芬兰和2.7％的1.7％CER。该研究的最大成绩是成功培训整个语料库的混合语言模型，并找到进一步提高结果的投票设置。

著录项

来源
《International Journal on Document Analysis and Recognition》 |2020年第4期|279-295|共17页
作者
Drobac Senka; Linden Krister;
展开▼
作者单位

Univ Helsinki Helsinki Finland;

Univ Helsinki Helsinki Finland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
OCR; Historical periodicals; Finnish; Swedish;

机译：OCR;历史期刊;芬兰语;瑞典语;
入库时间 2022-08-18 21:21:04

相似文献

外文文献
中文文献
专利

1. Handwritten Devanagari Character Recognition Using Layer-Wise Training of Deep Convolutional Neural Networks and Adaptive Gradient Methods [J] . Mahesh Jangid, Sumit Srivastava Journal of Imaging . 2018,第2期

机译：深度卷积神经网络的分层明智训练和自适应梯度法的手写体梵文字符识别
2. Optical character recognition in real environments using neural networks and k-nearest neighbor [J] . O. Matei, P. C. Pop, H. Vălean Applied Intelligence . 2013,第4期

机译：使用神经网络和k近邻在真实环境中进行光学字符识别
3. Optical character recognition in real environments using neural networks and k-nearest neighbor [J] . Matei O., Pop P.C., Vǎlean H. Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies . 2013,第4期

机译：使用神经网络和k最近邻在真实环境中进行光学字符识别
4. Optical Character Recognition for Hangul Character using Artificial Neural Network [C] . Selly Oktaviani, Christy Atika Sari, Eko Hari Rachmawanto, International Seminar on Application for Technology of Information and Communication . 2020

机译：基于人工神经网络的韩文字符光学字符识别
5. Comparison of Search Algorithms in Two-Stage Neural Network Training for Optical Character Recognition of Handwritten Digits [D] . Gilley, Patrik Wayne. 2020

机译：两级神经网络训练中搜索算法的比较，用于手写数字的光学字符识别
6. A Real-Time Automatic Plate Recognition System Based on Optical Character Recognition and Wireless Sensor Networks for ITS [O] . Nicole do Vale Dalarmelina, Marcio Andrey Teixeira, Rodolfo I. Meneguette 2020

机译：基于光学字符识别和无线传感器网络的ITS实时自动车牌识别系统
7. Optical character recognition with neural networks and post-correction with finite state methods [O] . Senka Drobac, Krister Lindén 2020

机译：具有神经网络的光学字符识别和具有有限状态方法的后校正

Optical character recognition with neural networks and post-correction with finite state methods

摘要

著录项

相似文献

相关主题

期刊订阅