首页> 外文会议>International Conference on Document Analysis and Recognition >The ENP image and ground truth dataset of historical newspapers
【24h】

The ENP image and ground truth dataset of historical newspapers

机译:历史报纸的ENP图像和地面真相数据集

获取原文

摘要

This paper presents a research dataset of historical newspapers comprising over 500 page images, uniquely representative of European cultural heritage from the digitization projects of 12 national and major European libraries, created within the scope of the large-scale digitisation Europeana Newspapers Project (ENP). Every image is accompanied by comprehensive ground truth (Unicode encoded full-text, layout information with precise region outlines, type labels, and reading order) in PAGE format and searchable metadata about document characteristics and artefacts. The first part of the paper describes the nature of the dataset, how it was built, and the challenges encountered. In the second part, a baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance.
机译:本文介绍了包含500页以上图像的历史报纸研究数据集,这些图像独特地代表了欧洲12个国家和主要欧洲图书馆的数字化项目中的欧洲文化遗产,该项目是在大规模数字化Europeana Newspapers Project(ENP)的范围内创建的。每张图像都带有PAGE格式的全面的地面信息(Unicode编码的全文本,带有精确区域轮廓的布局信息,类型标签和阅读顺序)以及有关文档特征和伪像的可搜索元数据。本文的第一部分描述了数据集的性质,如何构建以及遇到的挑战。在第二部分中,针对文本识别和分段/布局分析性能,给出了两个最先进的OCR系统(ABBYY FineReader Engine 11和Tesseract 3.03)的基线。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号