首页> 外文会议>Document recognition and retrieval XIX >A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods
【24h】

A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods

机译:用于开发和评估历史文档处理方法的合成文档图像数据集

获取原文
获取原文并翻译 | 示例

摘要

Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiques. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.
机译:带有OCR输出文本和地面真相转录的文档图像对于开发和评估文档识别和处理方法特别是历史文档图像很有用。另外,研究改善这种方法的性能通常需要对训练和测试数据(例如,主题文档标签)进行进一步注释。但是,转录和标记历史文档非常昂贵。结果,具有这种伴随资源的现有的现实世界文档图像数据集很少,并且通常相对较小。我们介绍了使用标准的(英文)文本语料库,使用一种以新颖方式应用的现有文档降级模型,创建的各种噪声水平的合成文档图像数据集。数据集中包括来自实际OCR引擎的OCR输出,包括商业ABBYY FineReader和开源Tesseract引擎。这些合成数据集旨在展现示例真实世界文档图像数据集(艾森豪威尔公报)的某些特征。由于新数据集的收集性质和先前的标注工作,新数据集还可以从存在的其他元数据中受益。通过在合成数据上训练现有的多引擎OCR校正方法,然后应用该模型以减少历史文档数据集上的单词错误率,我们证明了合成数据集的有用性。综合数据集将可供其他研究人员使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号