Autonomous cleaning of corrupted scanned documents — A generative modeling approach

机译：自主清除损坏的扫描文档-生成建模方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representations from irregular patterns. To learn character representations, we use a probabilistic generative model parameterizing pattern features, feature variances, the features' planar arrangements, and pattern frequencies. The latent variables of the model describe pattern class, pattern position, and the presence or absence of individual pattern features. The model parameters are optimized using a novel variational EM approximation. After learning, the parameters represent, independently of their absolute position, planar feature arrangements and their variances. A quality measure defined based on the learned representation then allows for an autonomous discrimination between regular character patterns and the irregular patterns making up the dirt. The irregular patterns can thus be removed to clean the document. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different examples using characters from different alphabets, we demonstrate generality of the approach and discuss its implications for future developments.

机译：我们研究了清理被污物（例如手动线条，溢出的墨水等）严重损坏的扫描文本文档的任务。我们的目标是仅根据页面中包含的信息自动清除单个字母大小的页面中的污物。因此，我们的方法必须在没有监督的情况下学习字符表示，并且需要一种将学习到的表示与不规则模式区分开的机制。要学习字符表示，我们使用概率生成模型来参数化图案特征，特征方差，特征的平面排列和图案频率。模型的潜在变量描述了图案类别，图案位置以及各个图案特征的存在与否。使用新颖的变分EM近似对模型参数进行优化。学习后，参数代表其平面位置布置及其方差，而与它们的绝对位置无关。然后，基于学习到的表示定义的质量度量允许在构成字符的规则字符模式和不规则模式之间进行自主区分。因此可以去除不规则图案以清洁文档。对于完整的拉丁字母，我们发现单个页面没有包含足够多的字符示例。但是，即使被污垢严重破坏了，我们也表明，仅根据包含的字符的结构规则性，就可以有效且自主地清除包含较少字符类型的页面。在使用来自不同字母的字符的不同示例中，我们展示了该方法的一般性，并讨论了其对未来发展的影响。

著录项

来源
《Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on》|2012年|p.3338- 3345|共8页
会议地点 Providence RI(US)
作者
Dai Zhenwen; Lucke Jorg;
展开▼
作者单位

Frankfurt Institute for Advanced Studies, Dept. of Physics, Goethe-University Frankfurt;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类 TP391.41;
关键词

相似文献

外文文献
中文文献
专利

1. Autonomous Document Cleaning—A Generative Approach to Reconstruct Strongly Corrupted Scanned Texts [J] . Dai Z., Lucke J. Pattern Analysis and Machine Intelligence, IEEE Transactions on . 2014,第10期

机译：自主文档清理—一种生成的方法，用于重建严重损坏的扫描文本
2. Label Correlation Mixture Model: A Supervised Generative Approach to Multilabel Spoken Document Categorization [J] . He Zhiyang, Wu Ji, Li Tao Emerging Topics in Computing, IEEE Transactions on . 2015,第2期

机译：标签相关混合模型：多标签口语文档分类的有监督生成方法
3. Spectral quality control in motion-corrupted single-voxel J-difference editing scans: an interleaved navigator approach. [J] . Bhattacharyya PK, Lowe MJ, Phillips MD Magnetic resonance in medicine: official journal of the Society of Magnetic Resonance in Medicine . 2007,第4期

机译：运动受损的单体素J差编辑扫描中的光谱质量控制：交错导航器方法。
4. Autonomous cleaning of corrupted scanned documents — A generative modeling approach [C] . Dai Zhenwen, Lucke Jorg IEEE Conference on Computer Vision and Pattern Recognition . 2012

机译：腐败扫描文件的自主清理 - 一种生成的建模方法
5. Generative Poisoning Attacks on Neural Network Models in Autonomous Driving [D] . Abdullah, Syed Hassan. 2021

机译：自主驾驶中神经网络模型的生成中毒攻击
6. A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models [O] . Dharitri Misra, Siyuan Chen, George R. Thoma -1

机译：使用布局识别和字符串模式搜索模型从扫描文档中自动提取元数据的系统
7. Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling Approach [O] . Dai, Zhenwen, Lücke, Jörg 2012

机译：自动清理损坏的扫描文件 - 一种生成性文件建模方法

Autonomous cleaning of corrupted scanned documents — A generative modeling approach

摘要

著录项

相似文献

相关主题

期刊订阅