首页> 外文会议>Document Recognition III >Model-based restoration of document images for OCR
【24h】

Model-based restoration of document images for OCR

机译:基于模型的OCR文档图像恢复

获取原文

摘要

Abstract: This paper presents a methodology for model based restoration of degraded document imagery. The methodology has the advantages of being able to adapt to nonuniform page degradations and of being based on a model of image defects that is estimated directly from a set of calibrating degraded document images. Further, unlike other global filtering schemes, our methodology filters only words that have been misspelled by the OCR with a high probability. In the first stage of the process, we extract a training sample of candidate misspelled word subimages from the set of calibration images before and after the degradation that we wish to undo. These word subimages are registered to extract defect pixels. The second stage of our methodology uses a vector quantization based algorithm to construct a summary model of the defect pixels. The final stage of the algorithm uses the summary model to restore degraded document images. We evaluate the performance of the methodology for a variety of parameter settings on a real world sample of degraded FAX transmitted documents. The methodology eliminates up to 56.4% of the OCR character errors introduced as a result of FAX transmission for our sample experiment. !12
机译:摘要:本文提出了一种基于模型的退化文档图像恢复方法。该方法的优点是能够适应不均匀的页面降级,并且基于图像缺陷模型,该模型直接从一组校准的降级文档图像中估算得出。此外,与其他全局过滤方案不同,我们的方法仅以很高的可能性过滤OCR拼写错误的单词。在此过程的第一阶段,我们从希望消除的降级前后的校准图像集中提取候选拼写错误的单词子图像的训练样本。这些字子图像被配准以提取缺陷像素。我们方法的第二阶段使用基于矢量量化的算法来构建缺陷像素的摘要模型。该算法的最后阶段使用摘要模型来还原降级的文档图像。我们在降级的FAX传输文档的真实样本中评估各种参数设置方法的性能。该方法可消除多达56.4%的因传真传输而导致的OCR字符错误,这是我们的示例实验所致。 !12

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号