首页> 外文会议>Digital Libraries, 2006. JCDL '06 >A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

【24h】

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

机译：基于等级，基于HMM的OCR准确性自动评估，用于数字图书图书馆

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.

机译：许多项目正在创建可搜索的印刷书籍数字图书馆。其中包括Million Book项目，Google Book项目以及Yahoo和Microsoft的类似工作。基于内容的基于折线书的检索通常需要先使用光学字符识别（OCR）引擎将打印的文本转换为机器可读的（例如ASCII）文本，然后对结果进行全文搜索。这些书中有很多是古老的，创建端到端系统需要执行各种处理步骤。更改任何步骤（包括扫描过程）都可能影响OCR性能，因此需要对书本长度的材料进行OCR性能的良好自动统计评估。评估整本书的OCR性能并非易事。唯一容易获得的地面真理（古腾堡电子文本）必须在整本书的长度上与OCR输出自动对齐。这可以被视为等同于比对两个大（很容易长一百万长）序列的问题。 OCR错误以及其中一个序列中可能会丢失大块材料的可能性使问题进一步复杂化。我们提出了一种基于隐马尔可夫模型（HMM）的层次对齐算法，以对齐OCR输出和书籍的地面真实性。我们认为这是不使用任何书籍结构信息就自动对齐整本书的第一项工作。对准过程通过将对准两个长序列的问题分解为对准许多较小的子序列的问题而起作用。这可以快速有效地完成。实验结果表明，即使OCR输出具有较高的识别错误率，我们的分层对齐方法也能很好地工作。最后，我们根据对齐结果评估商用OCR引擎对大型书籍数据集的性能。

著录项

来源
《Digital Libraries, 2006. JCDL '06》|2006年|P.109-118|共10页
会议地点
作者
Shaolei Feng; R. Manmatha;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类电子图书馆、数字图书馆;
关键词
digital libraries;

机译：数字图书馆;

相似文献

外文文献
中文文献
专利

1. Making digital libraries effective: Automatic generation of links for similarity search across hyper-textbooks [J] . Massimo Melucci Journal of the American Society for Information Science and Technology . 2004,第5期

机译：使数字图书馆有效：自动生成用于跨超级课本进行相似性搜索的链接
2. Performance evaluation of table type RFID reader for library automatic book identification [J] . Kiyotaka Fujisaki International journal of web information systems . 2020,第1期

机译：用于图书馆自动图书识别的台式RFID阅读器的性能评估
3. Evaluating hierarchical organisation structures for exploring digital libraries [J] . A. Squassabia Computing reviews . 2015,第5期

机译：评估组织层次结构以探索数字图书馆
4. A Hierarchical, Hmm-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books [C] . Shaolei Feng, R. Manmatha ACM/IEEE-CS Joint Conference on Digital Libraries . 2006

机译：基于分层的基于HMM的ICR精度自动评估图书中的数字图书馆
5. A multimodal fusion approach for automatic postal address recognition system using Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) techniques. [D] . Singh, Amriteshwar. 2011

机译：一种使用光学字符识别（OCR）和自动语音识别（ASR）技术的自动邮政地址识别系统的多模式融合方法。
6. The Digital Librarians Legal Handbook: Powerful Concise Insight into Intellectual Property Rights in 21st-Century Digital Library Collections [O] . Gloria Kroc 2014

机译：数字图书馆员的法律手册：21世纪数字图书馆馆藏中的强大简洁的知识产权见解
7. A hierarchical, hmm-based automatic evaluation of ocr accuracy for a digital library of books [O] . Shaolei Feng, R. Manmatha 2010

机译：基于hmm的分层，自动评估数字图书馆的ocr精度
8. Digital Talking Book Distribution Analysis: Audio Book Distribution System Design Submitted to the Library of Congress, National Library Service for the Blind and Physically Handicapped for Digital Talking Book Distribution Analysis Task 4: Transition [R] . 2006

机译：数字通话书分布分析：音频书籍分发系统设计提交给国会图书馆，国家图书馆盲人和残疾人数字通话书籍分发服务分析任务4：过渡

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

摘要

著录项

相似文献

相关主题

期刊订阅