首页> 外文会议>Digital Libraries, 2006. JCDL '06 >A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
【24h】

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

机译:基于等级,基于HMM的OCR准确性自动评估,用于数字图书图书馆

获取原文

摘要

A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.
机译:许多项目正在创建可搜索的印刷书籍数字图书馆。其中包括Million Book项目,Google Book项目以及Yahoo和Microsoft的类似工作。基于内容的基于折线书的检索通常需要先使用光学字符识别(OCR)引擎将打印的文本转换为机器可读的(例如ASCII)文本,然后对结果进行全文搜索。这些书中有很多是古老的,创建端到端系统需要执行各种处理步骤。更改任何步骤(包括扫描过程)都可能影响OCR性能,因此需要对书本长度的材料进行OCR性能的良好自动统计评估。评估整本书的OCR性能并非易事。唯一容易获得的地面真理(古腾堡电子文本)必须在整本书的长度上与OCR输出自动对齐。这可以被视为等同于比对两个大(很容易长一百万长)序列的问题。 OCR错误以及其中一个序列中可能会丢失大块材料的可能性使问题进一步复杂化。我们提出了一种基于隐马尔可夫模型(HMM)的层次对齐算法,以对齐OCR输出和书籍的地面真实性。我们认为这是不使用任何书籍结构信息就自动对齐整本书的第一项工作。对准过程通过将对准两个长序列的问题分解为对准许多较小的子序列的问题而起作用。这可以快速有效地完成。实验结果表明,即使OCR输出具有较高的识别错误率,我们的分层对齐方法也能很好地工作。最后,我们根据对齐结果评估商用OCR引擎对大型书籍数据集的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号