首页> 外文会议>Document recognition and retrieval XIX >A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods

【24h】

A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods

机译：用于开发和评估历史文档处理方法的合成文档图像数据集

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiques. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.

机译：带有OCR输出文本和地面真相转录的文档图像对于开发和评估文档识别和处理方法特别是历史文档图像很有用。另外，研究改善这种方法的性能通常需要对训练和测试数据（例如，主题文档标签）进行进一步注释。但是，转录和标记历史文档非常昂贵。结果，具有这种伴随资源的现有的现实世界文档图像数据集很少，并且通常相对较小。我们介绍了使用标准的（英文）文本语料库，使用一种以新颖方式应用的现有文档降级模型，创建的各种噪声水平的合成文档图像数据集。数据集中包括来自实际OCR引擎的OCR输出，包括商业ABBYY FineReader和开源Tesseract引擎。这些合成数据集旨在展现示例真实世界文档图像数据集（艾森豪威尔公报）的某些特征。由于新数据集的收集性质和先前的标注工作，新数据集还可以从存在的其他元数据中受益。通过在合成数据上训练现有的多引擎OCR校正方法，然后应用该模型以减少历史文档数据集上的单词错误率，我们证明了合成数据集的有用性。综合数据集将可供其他研究人员使用。

著录项

来源
《Document recognition and retrieval XIX》|2012年|p.829710.1-829710.8|共8页
会议地点 Burlingame CA(US)
作者
Daniel Walker; William Lund; Eric Ringger;
展开▼
作者单位

Natural Language Processing Lab, Computer Science Dept. Brigham Young University, Provo, UT, USA;

Natural Language Processing Lab, Computer Science Dept. Brigham Young University, Provo, UT, USA;

Natural Language Processing Lab, Computer Science Dept. Brigham Young University, Provo, UT, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类信息处理（信息加工）;
关键词
synthetic document images; OCR; datasets; document degradation models; historical document processing;

机译：合成文件图像； OCR；数据集记录降级模型；历史文件处理;

相似文献

外文文献
中文文献
专利

1. ACCF/AHA 2007 clinical expert consensus document on coronary artery calcium scoring by computed tomography in global cardiovascular risk assessment and in evaluation of patients with chest pain: a report of the American College of Cardiology Foundation Clinical Expert Consensus Task Force (ACCF/AHA Writing Committee to Update the 2000 Expert Consensus Document on Electron Beam Computed Tomography). Developed in Collaboration With the Society of Atherosclerosis Imaging and Prevention and the Society of Cardiovascular Computed Tomography [J] . Greenland P, Bonow RO, Brundage BH, Circulation: An Official Journal of the American Heart Association . 2007,第3期

机译：ACCF / AHA 2007年临床专家共识文件，关于通过计算机断层扫描在全球心血管风险评估和胸痛患者评估中对冠状动脉钙进行评分：美国心脏病学会基金会临床专家共识工作组（ACCF / AHA撰写委员会的报告）更新有关电子束CT的2000年专家共识文件）。与动脉粥样硬化影像学和预防学会以及心血管计算机断层摄影学会合作开发
2. Performance Evaluation Methodology for Historical Document Image Binarization [J] . Ntirogiannis K., Gatos B., Pratikakis I. Image Processing, IEEE Transactions on . 2013,第2期

机译：历史文献图像二值化性能评估方法
3. iDocChip: A Configurable Hardware Architecture for Historical Document Image Processing: Multiresolution Morphology-based Text and Image Segmentation [J] . Menbere Kina Tekleyohannes, Vladimir Rybalkin, Muhammad Mohsin Ghaffar, International journal of parallel programming . 2021,第2期

机译：IDOCCHIP：用于历史文档图像处理的可配置硬件架构：基于多分辨率的形态学文本和图像分割
4. Towards Document Image Quality Assessment: A Text Line Based Framework and a Synthetic Text Line Image Dataset [C] . Hongyu Li, Fan Zhu, Junhua Qiu International Conference on Document Analysis and Recognition . 2019

机译：迈向文档图像质量评估：基于文本行的框架和合成文本行图像数据集
5. Visual Information Retrieval from Historical Document Images =La recherche d’information visuelle à partir d’images de documents historiques [D] . Zhalehpour, Sara. 2018

机译：从历史文档检索的视觉信息检索=搜索历史文档的视觉信息
6. From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks [O] . Andrea Thomer, Gaurav Vaidya, Robert Guralnick, 2012

机译：从文档到数据集：一种基于MediaWiki的方法用于在具有百年历史的野外笔记本中注释和提取物种观测结果
7. A synthetic document image dataset for developing and evaluating historical document processing methods [O] . Daniel Walker, William Lund, Eric Ringger 2012

机译：用于开发和评估历史文档处理方法的合成文档图像数据集
8. Methodology for End-to-End Evaluation of Arabic Document Image Processing Software [R] . Herceg, P. M. , Ball, C. N. 2006

机译：阿拉伯文档图像处理软件端到端评估方法

A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅