首页> 外文会议>Language and technology conference >How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine - Final Notes on Development and Evaluation
【24h】

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine - Final Notes on Development and Evaluation

机译:如何利用开源TESSEACT OCR引擎改善历史芬兰报纸的光学字符识别 - 关于开发和评估的最终说明

获取原文

摘要

The current paper presents work that has been carried out in the National Library of Finland (NLF) to improve optical character recognition (OCR) quality of the historical Finnish newspaper collection 1771-1910. Evaluation results reported in the paper are based mainly on a 500 000 word sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-OCRed version for comparison with Tesseract's OCR. Using this sample and its page image originals we have developed a re-OCRing procedure using the open source software package Tesseract v. 3.04.01. Our method achieved initially 27.48% improvement vs. ABBYY FineReader 7 or 8 and 9.16% improvement vs. ABBYY FineReader 11 on document level. On word level our method achieved 36.25% improvement vs. ABBYY FineReader 7 or 8 and 20.14% improvement vs. ABBYY FineReader 11. Our final precision and recall results on word level show clear improvement in the quality: recall is 76.0 and precision 92.0 in comparison to GT OCR. Other measures, such as recognizability of words with a morphological analyzer and character accuracy rate, show also steady improvement after re-OCRing.
机译:目前的论文提出了在芬兰国家图书馆(NLF)中进行的工作,以改善历史芬兰报纸收集1771-1910的光学字符识别(OCR)质量。本文报告的评估结果主要基于全集合的芬兰语一部分的500 000字样。该样本有三个不同的并行部分:手动纠正的地面真理版本,原始OCR与ABByy FineReader v.7或v.8,以及ABByy FineReader v.11重新调节版本,用于与Tesseract的OCR进行比较。使用此示例及其页面图像原稿我们使用开源软件包TESERACT v.3.04.01开发了一种重新概述程序。我们的方法最初实现了27.48%的改进与ABByy Finereader 7或8和9.16%的改进与agbyy Finereader 11上的文件级别。在Word Level上,我们的方法实现了36.25%的改进与ABByy Finereader 7或8和20.14%的改进与Abbyy FineReader 11.我们的最终精确度和召回结果在字水平上显示出质量的明确改善:召回是76.0和精度92.0相比之下到gtocr。其他措施,例如具有形态分析仪和性质准确率的单词的识别性,并且在重新occring后也显示出稳定的改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号