A framework to access handwritten information within large digitized paper collections

机译：一个在大型数字化纸质收藏中访问手写信息的框架

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We describe our efforts with the National Archives and Records Administration (NARA) to provide a form of automated search of handwritten content within large digitized document archives. With a growing push towards the digitization of paper archives there is an imminent need to develop tools capable of searching the resulting unstructured image data as data from such collections offer valuable historical records that can be mined for information pertinent to a number of fields from the geosciences to the humanities. To carry out the search, we use a Computer Vision technique called Word Spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing the text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive, three computationally expensive pre-processing steps are required. We describe these steps, the open source framework we have developed, and how it can be used not only on the recently released 1940 Census data containing nearly 4 million high resolution scanned forms, but also on other collections of forms. With a growing demand to digitize our wealth of paper archives we see this type of automated search as a low cost scalable alternative to the costly manual transcription that would otherwise be required.

机译：我们描述了与美国国家档案和记录管理局（NARA）的合作，以在大型数字化文档档案中提供一种自动搜索手写内容的形式。随着纸质档案数字化的不断发展，迫切需要开发能够搜索生成的非结构化图像数据的工具，因为来自此类馆藏的数据提供了宝贵的历史记录，可用于挖掘与地球科学的许多领域相关的信息对人文科学。为了进行搜索，我们使用了一种称为单词斑点的计算机视觉技术。它是基于内容的图像检索的一种形式，它通过允许用户使用包含手写文本的查询图像进行搜索，并根据包含相似外观的图像对图像数据库进行排名，避免了直接识别文本这一艰巨的任务。为了使此搜索功能可用于存档，需要三个计算上昂贵的预处理步骤。我们描述了这些步骤，我们开发的开源框架以及如何将其不仅用于最近发布的1940年人口普查数据，其中包含近400万高分辨率的扫描表格，还可以用于其他表格集合。随着数字化我们的纸质档案的需求不断增长，我们将这种类型的自动搜索视为一种低成本，可扩展的替代方案，可以替代原本需要的昂贵人工转录。

著录项

来源
《2012 IEEE 8th International Conference on E-Science.》|2012年|p.1-10|共10页
会议地点 Chicago IL(US);Chicago IL(US)
作者
Diesendruck Liana; Marini Luigi; Kooper Rob; Kejriwal Mayank; McHenry Kenton;
展开▼
作者单位

National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Collections Digitization Framework: A Service-oriented Approach to Digitization in Academic Libraries [J] . Karim Tharani Partnership: the Canadian Journal of Library and Information Practice and Research . 2012,第2期

机译：馆藏数字化框架：高校图书馆数字化服务导向
2. Digitization: Does It Always Improve Access to Rare Books and Special Collections? [J] . Dale J. Correa Microform and imaging review . 2016,第4期

机译：数字化：它是否总是能改善对稀有书籍和特殊藏书的访问？
3. The Collections Access Project: Digitization from the Ground Up at Historic New England [J] . David Dwiggins Microform & digitization review . 2012,第1期

机译：收藏访问项目：新英格兰历史悠久的数字化
4. A framework to access handwritten information within large digitized paper collections [C] . Diesendruck Liana, Marini Luigi, Kooper Rob, IEEE International Conference on E-Science . 2012

机译：用于访问大数字化纸质集合中的手写信息的框架
5. Integrating a mobile accessible electronic system into dockside monitoring: How can small-scale fisheries data collection programs transition from paper-based to digital data collection? [D] . Thuesen, Gretchen. 2016

机译：将移动无障碍电子系统集成到码头监控中：小型渔业数据收集计划如何从基于纸张的数据收集过渡到数字数据收集？
6. Green digitization: Online botanical collections data answering real‐world questions [O] . Pamela S. Soltis, Gil Nelson, Shelley A. James 2018

机译：绿色数字化：在线植物收藏数据回答现实世界中的问题
7. A Framework to Access Handwritten Information within Large Digitized Paper Collections [O] . Liana Diesendruck, Luigi Marini, Rob Kooper, 2015

机译：大型数字化论文集中访问手写信息的框架

A framework to access handwritten information within large digitized paper collections

摘要

著录项

相似文献

相关主题

期刊订阅