首页> 外文会议>IEEE Winter Conference on Applications of Computer Vision >DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images
【24h】

DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images

机译:DeepErase:文档文本图像中弱监督的油墨伪像去除

获取原文

摘要

Paper-intensive industries like insurance, law, and government have long leveraged optical character recognition (OCR) to automatically transcribe hordes of scanned documents into text strings for downstream processing. Even in 2019, there are still many scanned documents and mail that come into businesses in non-digital format. Text to be extracted from real world documents is often nestled inside rich formatting, such as tabular structures or forms with fill-in-the-blank boxes or underlines whose ink often touches or even strikes through the ink of the text itself. Further, the text region could have random ink smudges or spurious strokes. Such ink artifacts can severely interfere with the performance of recognition algorithms or other downstream processing tasks. In this work, we propose DeepErase, a neural-based preprocessor to erase ink artifacts from text images. We devise a method to programmatically assemble real text images and real artifacts into realistic-looking "dirty" text images, and use them to train an artifact segmentation network in a weakly supervised manner, since pixel-level annotations are automatically obtained during the assembly process. In addition to high segmentation accuracy, we show that our cleansed images achieve a significant boost in recognition accuracy by popular OCR software such as Tesseract 4.0. Finally, we test DeepErase on out-of-distribution datasets (NIST SDB) of scanned IRS tax return forms and achieve double-digit improvements in accuracy. All experiments are performed on both printed and handwritten text.
机译:保险,法律和政府等纸张密集型行业长期以来一直在利用光学字符识别(OCR)来自动将成批扫描的文档转录为文本字符串,以进行下游处理。即使在2019年,仍然有许多以非数字格式进入企业的扫描文档和邮件。从现实世界文档中提取的文本通常嵌套在丰富的格式中,例如带有空白填充框或下划线的表格结构或表格,其下划线经常触及甚至触及到文本本身的下划线。此外,文本区域可能会有随机的墨水污迹或伪造的笔触。这样的墨水伪影会严重干扰识别算法或其他下游处理任务的性能。在这项工作中,我们提出了DeepErase,这是一种基于神经的预处理器,可从文本图像中擦除墨水伪像。我们设计了一种方法,以编程方式将真实文本图像和真实伪像组装为看上去逼真的“脏”文本图像,并使用它们以弱监督的方式训练伪像分割网络,因为在组装过程中会自动获得像素级注释。除了较高的分割精度外,我们还证明,通过流行的OCR软件(例如Tesseract 4.0),我们清洗后的图像可显着提高识别精度。最后,我们在扫描的IRS纳税申报表的分布外数据集(NIST SDB)上测试DeepErase,并实现两位数的准确性提高。所有实验均针对印刷文本和手写文本进行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号