DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images

机译：DeepErase：文档文本图像中弱监督的油墨伪像去除

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Paper-intensive industries like insurance, law, and government have long leveraged optical character recognition (OCR) to automatically transcribe hordes of scanned documents into text strings for downstream processing. Even in 2019, there are still many scanned documents and mail that come into businesses in non-digital format. Text to be extracted from real world documents is often nestled inside rich formatting, such as tabular structures or forms with fill-in-the-blank boxes or underlines whose ink often touches or even strikes through the ink of the text itself. Further, the text region could have random ink smudges or spurious strokes. Such ink artifacts can severely interfere with the performance of recognition algorithms or other downstream processing tasks. In this work, we propose DeepErase, a neural-based preprocessor to erase ink artifacts from text images. We devise a method to programmatically assemble real text images and real artifacts into realistic-looking "dirty" text images, and use them to train an artifact segmentation network in a weakly supervised manner, since pixel-level annotations are automatically obtained during the assembly process. In addition to high segmentation accuracy, we show that our cleansed images achieve a significant boost in recognition accuracy by popular OCR software such as Tesseract 4.0. Finally, we test DeepErase on out-of-distribution datasets (NIST SDB) of scanned IRS tax return forms and achieve double-digit improvements in accuracy. All experiments are performed on both printed and handwritten text.

机译：保险，法律和政府等纸张密集型行业长期以来一直在利用光学字符识别（OCR）来自动将成批扫描的文档转录为文本字符串，以进行下游处理。即使在2019年，仍然有许多以非数字格式进入企业的扫描文档和邮件。从现实世界文档中提取的文本通常嵌套在丰富的格式中，例如带有空白填充框或下划线的表格结构或表格，其下划线经常触及甚至触及到文本本身的下划线。此外，文本区域可能会有随机的墨水污迹或伪造的笔触。这样的墨水伪影会严重干扰识别算法或其他下游处理任务的性能。在这项工作中，我们提出了DeepErase，这是一种基于神经的预处理器，可从文本图像中擦除墨水伪像。我们设计了一种方法，以编程方式将真实文本图像和真实伪像组装为看上去逼真的“脏”文本图像，并使用它们以弱监督的方式训练伪像分割网络，因为在组装过程中会自动获得像素级注释。除了较高的分割精度外，我们还证明，通过流行的OCR软件（例如Tesseract 4.0），我们清洗后的图像可显着提高识别精度。最后，我们在扫描的IRS纳税申报表的分布外数据集（NIST SDB）上测试DeepErase，并实现两位数的准确性提高。所有实验均针对印刷文本和手写文本进行。

著录项

来源
《IEEE Winter Conference on Applications of Computer Vision》|2020年|3511-3519|共9页
会议地点
作者
Yike Qi; W. Ronny Huang; Qianqian Li; Jonathan L. DeGange;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Task analysis; Image segmentation; Ink; Optical character recognition software; Text recognition; NIST; Image recognition;

机译：任务分析;图像分割;墨水;光学字符识别软件;文本识别; NIST;图像识别;

相似文献

外文文献
中文文献
专利

1. Weakly supervised precise segmentation for historical document images [J] . Xie Zecheng, Huang Yaoxiong, Jin Lianwen, Neurocomputing . 2019,第JULa20期

机译：监督不足的历史文档图像精确分割
2. Weakly supervised precise segmentation for historical document images [J] . Xie Zecheng, Huang Yaoxiong, Jin Lianwen, Neurocomputing . 2019,第Jul20期

机译：历史文档图像弱监督的精确分割
3. AUTOMATIC TEXT EXTRACTION, REMOVAL AND INPAINTING OF COMPLEX DOCUMENT IMAGES [J] . Yen-Lin Chen International Journal of Innovative Computing Information and Control . 2012,第1A期

机译：自动提取，删除和输入复杂文档图像
4. Weakly Supervised Text Attention Network for Generating Text Proposals in Scene Images [C] . Li Rong, En MengYi, Li JianQiang, IAPR International Conference on Document Analysis and Recognition . 2017

机译：弱监督文本注意网络，用于在场景图像中生成文本建议
5. Enhancement and artifact removal for transform coded document images. [D] . Wong, Tak Shing. 2011

机译：增强功能和去除伪影，用于转换编码的文档图像。
6. A clinical text classification paradigm using weak supervision and deep representation [O] . Yanshan Wang, Sunghwan Sohn, Sijia Liu, 2019

机译：使用弱监督和深度表示的临床文本分类范例
7. Curved Text Detection in Natural Scene Images with Semi- and Weakly-Supervised Learning [O] . Xugong Qin, Yu Zhou, Dongbao Yang, 2019

机译：曲线文本检测在自然场景图像中，半和虚弱的学习

DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images

摘要

著录项

相似文献

相关主题

期刊订阅