首页> 外文会议>Document Recognition III >Model-based restoration of document images for OCR

【24h】

Model-based restoration of document images for OCR

机译：基于模型的OCR文档图像恢复

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Abstract: This paper presents a methodology for model based restoration of degraded document imagery. The methodology has the advantages of being able to adapt to nonuniform page degradations and of being based on a model of image defects that is estimated directly from a set of calibrating degraded document images. Further, unlike other global filtering schemes, our methodology filters only words that have been misspelled by the OCR with a high probability. In the first stage of the process, we extract a training sample of candidate misspelled word subimages from the set of calibration images before and after the degradation that we wish to undo. These word subimages are registered to extract defect pixels. The second stage of our methodology uses a vector quantization based algorithm to construct a summary model of the defect pixels. The final stage of the algorithm uses the summary model to restore degraded document images. We evaluate the performance of the methodology for a variety of parameter settings on a real world sample of degraded FAX transmitted documents. The methodology eliminates up to 56.4% of the OCR character errors introduced as a result of FAX transmission for our sample experiment. !12

机译：摘要：本文提出了一种基于模型的退化文档图像恢复方法。该方法的优点是能够适应不均匀的页面降级，并且基于图像缺陷模型，该模型直接从一组校准的降级文档图像中估算得出。此外，与其他全局过滤方案不同，我们的方法仅以很高的可能性过滤OCR拼写错误的单词。在此过程的第一阶段，我们从希望消除的降级前后的校准图像集中提取候选拼写错误的单词子图像的训练样本。这些字子图像被配准以提取缺陷像素。我们方法的第二阶段使用基于矢量量化的算法来构建缺陷像素的摘要模型。该算法的最后阶段使用摘要模型来还原降级的文档图像。我们在降级的FAX传输文档的真实样本中评估各种参数设置方法的性能。该方法可消除多达56.4％的因传真传输而导致的OCR字符错误，这是我们的示例实验所致。！12

著录项

来源
《Document Recognition III》|1996年|p.297-308|共12页
会议地点
作者
Mysore Y. Jaisimha; MathSoft; Inc.; Seattle; WA; USA; Eve A. Riskin; Univ. of Washington; Seattle; WA; USA; Richard Ladner; Univ. of Washington; Seattle; WA; USA; Werner Stuetzle; Univ. of Washington; Seattle; WA; USA.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2001,第711期

机译：容忍文档图像OCR错误的基于模型的信息提取方法
2. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：容忍文档图像OCR错误的基于模型的信息提取方法
3. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：基于模型的信息提取方法容忍文档图像的OCR错误
4. Model-based restoration of document images for OCR [C] . Mysore Y. Jaisimha, Eve A. Riskin, Richard Ladner, Conference on document recognition . 1996

机译：基于模型的OCR文档图像的恢复
5. Iterative model-based binarization for document images. [D] . Dawoud, Amer. 2003

机译：基于迭代模型的文档图像二值化。
6. Towards Mobile OCR: How To Take a Good Picture of a Document Without Sight [O] . Michael Cutter, Roberto Manduchi -1

机译：迈向移动OCR：如何在无视的情况下对文档进行良好的拍摄
7. Model-based Iterative Restoration for Binary Document Image Compression with Dictionary Learning [O] . Guo, Yandong, Lu, Cheng, Allebach, Jan P., 2017

机译：基于模型的二值文档图像压缩迭代恢复与词典学习
8. Model Based Restoration of Document Images for OCR [R] . M. Y. Jaisimha, Eve A. Riskin, Richard Ladner 1996

机译：基于模型的OCR文档图像恢复

Model-based restoration of document images for OCR

摘要

著录项

相似文献

相关主题

期刊订阅