首页> 外文会议>Language and technology conference >How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine - Final Notes on Development and Evaluation

【24h】

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine - Final Notes on Development and Evaluation

机译：如何利用开源TESSEACT OCR引擎改善历史芬兰报纸的光学字符识别 - 关于开发和评估的最终说明

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The current paper presents work that has been carried out in the National Library of Finland (NLF) to improve optical character recognition (OCR) quality of the historical Finnish newspaper collection 1771-1910. Evaluation results reported in the paper are based mainly on a 500 000 word sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-OCRed version for comparison with Tesseract's OCR. Using this sample and its page image originals we have developed a re-OCRing procedure using the open source software package Tesseract v. 3.04.01. Our method achieved initially 27.48％ improvement vs. ABBYY FineReader 7 or 8 and 9.16％ improvement vs. ABBYY FineReader 11 on document level. On word level our method achieved 36.25％ improvement vs. ABBYY FineReader 7 or 8 and 20.14％ improvement vs. ABBYY FineReader 11. Our final precision and recall results on word level show clear improvement in the quality: recall is 76.0 and precision 92.0 in comparison to GT OCR. Other measures, such as recognizability of words with a morphological analyzer and character accuracy rate, show also steady improvement after re-OCRing.

机译：目前的论文提出了在芬兰国家图书馆（NLF）中进行的工作，以改善历史芬兰报纸收集1771-1910的光学字符识别（OCR）质量。本文报告的评估结果主要基于全集合的芬兰语一部分的500 000字样。该样本有三个不同的并行部分：手动纠正的地面真理版本，原始OCR与ABByy FineReader v.7或v.8，以及ABByy FineReader v.11重新调节版本，用于与Tesseract的OCR进行比较。使用此示例及其页面图像原稿我们使用开源软件包TESERACT v.3.04.01开发了一种重新概述程序。我们的方法最初实现了27.48％的改进与ABByy Finereader 7或8和9.16％的改进与agbyy Finereader 11上的文件级别。在Word Level上，我们的方法实现了36.25％的改进与ABByy Finereader 7或8和20.14％的改进与Abbyy FineReader 11.我们的最终精确度和召回结果在字水平上显示出质量的明确改善：召回是76.0和精度92.0相比之下到gtocr。其他措施，例如具有形态分析仪和性质准确率的单词的识别性，并且在重新occring后也显示出稳定的改善。

著录项

来源
《Language and technology conference》|2017年|17-30|共14页
会议地点
作者
Mika Koistinen; Kimmo Kettunen; Jukka Kervinen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Optical character recognition; Historical newspaper collections; Evaluation; Finnish;

机译：光学字符识别;历史报纸收藏;评估;芬兰;
入库时间 2022-08-26 13:58:05

相似文献

外文文献
中文文献
专利

1. Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process [J] . Kimmo Kettunen, Mika Koistinen, Jukka Kervinen LIBER Quarterly - Journal of European Research Libraries . 2020,第1期

机译：芬兰历史报纸的地面真理OCR样本数据改进验证重新响应的验证
2. Hybrid model for Chinese character recognition based on Tesseract-OCR [J] . International journal of internet protocol technology . 2020,第2期

机译：基于Tesseract-OCR的汉字识别混合模型
3. Hybrid model for Chinese character recognition based on Tesseract-OCR [J] . Industrial and organizational psychology . 2020,第2期

机译：基于TESSERACT-OCR的汉字识别混合模型
4. Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur Antiqua Models and Image Preprocessing [C] . Mika Koistinen, Kimmo Kettunen, Tuula Paeaekkoenen 21st Nordic Conference of Computational Linguistics . 2017

机译：Fraktur和Antiqua模型与图像预处理相结合，提高芬兰历史报纸的光学字符识别能力
5. An Intelligent Semi-Automatic Workflow for Optical Character Recognition of Historical Printings =Ein intelligenter semi-automatischer Workflow für die OCR historischer Drucke [D] . Reul, Christian. 2020

机译：用于光学字符识别的智能半自动工作流程识别历史印刷= OCR历史印刷品的智能半自动工作流程
6. The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels [O] . Robyn E. Drinkwater, Robert W. N. Cubey, Elspeth M. Haston 2014

机译：光学字符识别（OCR）在植物标本标签数字化中的使用
7. Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study [O] . Chirag Patel, Smt Chandaben Mohanbhai, Dharmendra Patel 2012

机译：开源OCR工具Tesseract识别光学字符的案例研究
8. Arabic Optical Character Recognition (OCR) Evaluation in Order to Develop a Post-OCR Module [R] . Kjersten, B. 2011

机译：阿拉伯语光学字符识别（OCR）评估，以开发后OCR模块

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine - Final Notes on Development and Evaluation

摘要

著录项

相似文献

相关主题

期刊订阅