首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >An automatic closed-loop methodology for generating character groundtruth for scanned documents
【24h】

An automatic closed-loop methodology for generating character groundtruth for scanned documents

机译:一种自动闭环方法,用于为扫描的文档生成字符基础

获取原文
获取原文并翻译 | 示例

摘要

Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document.
机译:真实,已扫描文档图像的特征基础对于评估OCR系统的性能,训练OCR算法以及验证文档降级模型至关重要。不幸的是,手动收集真实的(扫描的)文档图像中的字符的准确的地面真相是不切实际的,因为(i)描绘地面真相字符边界框的准确性不够高;(ii)这非常费力且费时,并且(iii) )执行此任务所需的体力劳动非常昂贵。 Ee描述了一种闭环方法,用于收集扫描文档的非常准确的地面真相。我们首先使用排版语言创建理想的文档。接下来,我们为理想文档创建基础。然后打印,复印和扫描理想文档。配准算法估计全局几何变换,然后执行鲁棒的局部位图匹配,以将理想文档图像配准到扫描的文档图像。最后,使用估计的几何变换对与理想文档图像关联的地面信息进行变换,以创建扫描文档图像的地面信息。这种方法非常通用,可用于为任何语言,布局,字体和样式的排版文档创建基础。我们已经通过为英语,印地语和传真文档图像生成groundtruth演示了该方法。使用我们的方法创建地面真理的成本是最小的。如果字符,单词或区域groundtruth可用于任何真实文档,则可使用注册算法为文档的重新扫描版本生成相应的groundtruth。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号