首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >A scale space approach for automatically segmenting words from historical handwritten documents
【24h】

A scale space approach for automatically segmenting words from historical handwritten documents

机译:用于自动分割历史手写文档中的单词的比例空间方法

获取原文
获取原文并翻译 | 示例

摘要

Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers of early presidents like George Washington at the Library of Congress. The first step in providing recognition/retrieval tools is to automatically segment handwritten pages into words. State of the art segmentation techniques like the gap metrics algorithm have been mostly developed and tested on highly constrained documents like bank checks and postal addresses. There has been little work on full handwritten pages and this work has usually involved testing on clean artificial documents created for the purpose of research. Historical manuscript images, on the other hand, contain a great deal of noise and are much more challenging. Here, a novel scale space algorithm for automatically segmenting handwritten (historical) documents into words is described. First, the page is cleaned to remove margins. This is followed by a gray-level projection profile algorithm for finding lines in images. Each line image is then filtered with an anisotropic Laplacian at several scales. This procedure produces blobs which correspond to portions of characters at small scales and to words at larger scales. Crucial to the algorithm is scale selection that is, finding the optimum scale at which blobs correspond to words. This is done by finding the maximum over scale of the extent or area of the blobs. This scale maximum is estimated using three different approaches. The blobs recovered at the optimum scale are then bounded with a rectangular box to recover the words. A post processing filtering step is performed to eliminate boxes of unusual size which are unlikely to correspond to words. The approach is tested on a number of different data sets and it is shown that, on 100 sampled documents from the George Washington corpus of handwritten document images, a total error rate of 17 percent is observed. The technique outperforms a state-of-the-art gap metrics word-segmentation algorithm on this collection.
机译:许多图书馆,博物馆和其他组织都包含大量的手写历史文档,例如,国会图书馆的乔治·华盛顿等早期总统的论文。提供识别/检索工具的第一步是将手写页面自动分段成单词。最先进的细分技术(例如差距度量算法)已经在高度受限的文档(例如银行支票和邮政地址)上开发和测试。在完整的手写页面上进行的工作很少,该工作通常涉及对为研究目的而创建的干净的人工文档进行测试。另一方面,历史手稿图像包含大量噪声,并且更具挑战性。在这里,描述了一种新颖的比例空间算法,用于自动将手写(历史)文档分割成单词。首先,清洁页面以去除页边距。接下来是用于在图像中查找线条的灰度投影轮廓算法。然后,使用各向异性拉普拉斯算子按几个比例对每个线图像进行滤波。此过程会产生斑点,这些斑点对应小比例的字符部分和较大比例的单词。该算法的关键是标度选择,即找到斑点对应于单词的最佳标度。这是通过找到斑点范围或区域的最大最大比例来完成的。使用三种不同的方法来估算最大规模。然后将以最佳比例恢复的斑点与一个矩形框绑定,以恢复单词。执行后处理过滤步骤以消除不太可能与单词相对应的异常大小的框。该方法在许多不同的数据集上进行了测试,结果表明,在来自George Washington手写文档图像库的100个采样文档中,观察到的总错误率为17%。该技术在此集合上的表现优于最新的差距指标单词细分算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号