首页> 外文会议>2012 IEEE 8th International Conference on E-Science. >A framework to access handwritten information within large digitized paper collections
【24h】

A framework to access handwritten information within large digitized paper collections

机译:一个在大型数字化纸质收藏中访问手写信息的框架

获取原文
获取原文并翻译 | 示例

摘要

We describe our efforts with the National Archives and Records Administration (NARA) to provide a form of automated search of handwritten content within large digitized document archives. With a growing push towards the digitization of paper archives there is an imminent need to develop tools capable of searching the resulting unstructured image data as data from such collections offer valuable historical records that can be mined for information pertinent to a number of fields from the geosciences to the humanities. To carry out the search, we use a Computer Vision technique called Word Spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing the text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive, three computationally expensive pre-processing steps are required. We describe these steps, the open source framework we have developed, and how it can be used not only on the recently released 1940 Census data containing nearly 4 million high resolution scanned forms, but also on other collections of forms. With a growing demand to digitize our wealth of paper archives we see this type of automated search as a low cost scalable alternative to the costly manual transcription that would otherwise be required.
机译:我们描述了与美国国家档案和记录管理局(NARA)的合作,以在大型数字化文档档案中提供一种自动搜索手写内容的形式。随着纸质档案数字化的不断发展,迫切需要开发能够搜索生成的非结构化图像数据的工具,因为来自此类馆藏的数据提供了宝贵的历史记录,可用于挖掘与地球科学的许多领域相关的信息对人文科学。为了进行搜索,我们使用了一种称为单词斑点的计算机视觉技术。它是基于内容的图像检索的一种形式,它通过允许用户使用包含手写文本的查询图像进行搜索,并根据包含相似外观的图像对图像数据库进行排名,避免了直接识别文本这一艰巨的任务。为了使此搜索功能可用于存档,需要三个计算上昂贵的预处理步骤。我们描述了这些步骤,我们开发的开源框架以及如何将其不仅用于最近发布的1940年人口普查数据,其中包含近400万高分辨率的扫描表格,还可以用于其他表格集合。随着数字化我们的纸质档案的需求不断增长,我们将这种类型的自动搜索视为一种低成本,可扩展的替代方案,可以替代原本需要的昂贵人工转录。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号