首页> 外文期刊>Pattern Analysis and Machine Intelligence, IEEE Transactions on >Autonomous Document Cleaning—A Generative Approach to Reconstruct Strongly Corrupted Scanned Texts
【24h】

Autonomous Document Cleaning—A Generative Approach to Reconstruct Strongly Corrupted Scanned Texts

机译:自主文档清理—一种生成的方法,用于重建严重损坏的扫描文本

获取原文
获取原文并翻译 | 示例

摘要

We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink, etc. We aim at autonomously removing such corruptions from a single letter-size page based only on the information the page contains. Our approach first learns character representations from document patches without supervision. For learning, we use a probabilistic generative model parameterizing pattern features, their planar arrangements and their variances. The model’s latent variables describe pattern position and class, and feature occurrences. Model parameters are efficiently inferred using a truncated variational EM approach. Based on the learned representation, a clean document can be recovered by identifying, for each patch, pattern class and position while a quality measure allows for discrimination between character and non-character patterns. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different example applications with different alphabets, we demonstrate and discuss the effectiveness, efficiency and generality of the approach.
机译:我们研究了清理被污垢(例如手动线条,溢出的墨水等)严重损坏的扫描文本文档的任务。我们的目标是仅根据页面包含的信息,从单个字母大小的页面上自动消除此类损坏。我们的方法首先在没有监督的情况下从文档补丁中学习字符表示。为了学习,我们使用概率生成模型参数化模式特征,其平面排列及其方差。模型的潜在变量描述了图案的位置和类别以及特征的出现。使用截断变分EM方法可以有效地推断模型参数。基于学习到的表示,可以通过针对每个补丁识别图案类别和位置来恢复干净的文档,而质量度量则可以区分字符图案和非字符图案。对于完整的拉丁字母,我们发现单个页面没有包含足够多的字符示例。但是,即使被污垢严重破坏了,我们也表明,仅根据包含的字符的结构规则性,就可以有效且自主地清除包含较少字符类型的页面。在具有不同字母的示例应用程序中,我们演示并讨论了该方法的有效性,效率和通用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号