The ENP image and ground truth dataset of historical newspapers

机译：历史报纸的ENP图像和地面真相数据集

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a research dataset of historical newspapers comprising over 500 page images, uniquely representative of European cultural heritage from the digitization projects of 12 national and major European libraries, created within the scope of the large-scale digitisation Europeana Newspapers Project (ENP). Every image is accompanied by comprehensive ground truth (Unicode encoded full-text, layout information with precise region outlines, type labels, and reading order) in PAGE format and searchable metadata about document characteristics and artefacts. The first part of the paper describes the nature of the dataset, how it was built, and the challenges encountered. In the second part, a baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance.

机译：本文介绍了包含500页以上图像的历史报纸研究数据集，这些图像独特地代表了欧洲12个国家和主要欧洲图书馆的数字化项目中的欧洲文化遗产，该项目是在大规模数字化Europeana Newspapers Project（ENP）的范围内创建的。每张图像都带有PAGE格式的全面的地面信息（Unicode编码的全文本，带有精确区域轮廓的布局信息，类型标签和阅读顺序）以及有关文档特征和伪像的可搜索元数据。本文的第一部分描述了数据集的性质，如何构建以及遇到的挑战。在第二部分中，针对文本识别和分段/布局分析性能，给出了两个最先进的OCR系统（ABBYY FineReader Engine 11和Tesseract 3.03）的基线。

著录项

来源
《International Conference on Document Analysis and Recognition》|2015年|931-935|共5页
会议地点
作者
Clausner Christian; Papadopoulos Christos; Pletschacher Stefan; Antonacopoulos Apostolos;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
document image processing; history; image segmentation; libraries; meta data; optical character recognition; publishing; text detection; ABBYY FineReader Engine 11; ENP image; European cultural heritage; Europeana Newspapers Project digitization; OCR systems; PAGE format; Tesseract 3.03; document characteristics; historical newspaper ground truth dataset; major European libraries; national European libraries; searchable metadata; segmentation/layout analysis performance; text recognition; Engines; Europe; Optical character recognition software; document analysis; ground truth; historical documents; image dataset;

机译：文档图像处理;历史记录;图像分割;图书馆;元数据;光学字符识别;发布;文本检测; ABBYY FineReader Engine 11; ENP图像;欧洲文化遗产;欧洲报纸项目数字化; OCR系统; PAGE格式; Tesseract 3.03;文档特征;历史报纸地面事实数据集;欧洲主要图书馆;欧洲国家图书馆;可搜索的元数据;分段/布局分析性能;文本识别;引擎;欧洲;光学字符识别软件;文档分析;地面事实;历史文档;图像数据集;

相似文献

外文文献
中文文献
专利

1. Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process [J] . Kimmo Kettunen, Mika Koistinen, Jukka Kervinen LIBER Quarterly - Journal of European Research Libraries . 2020,第1期

机译：芬兰历史报纸的地面真理OCR样本数据改进验证重新响应的验证
2. PFuji-Size dataset: A collection of images and photogrammetry-derived 3D point clouds with ground truth annotations for Fuji apple detection and size estimation in field conditions [J] . Jordi Gené-Mola, Ricardo Sanz-Cortiella, Joan R. Rosell-Polo, Data in Brief . 2021,第a期

机译：pfuji-size dataSet：带有地面真理注释的图像和摄影测量的3D点云的集合，用于富士Apple检测和尺寸估计现场条件
3. Corrigendum to “A large dataset of synthetic SEM images of powder materials and their ground truth 3D structures” [Data Brief 9 (2016) 727–731] [J] . Brian L. DeCost, Elizabeth A. Holm Data in Brief . 2018,第1期

机译：关于“粉末材料及其地面真实3D结构的合成SEM图像的大型数据集”的更正[数据摘要9（2016）727–731]
4. The ENP Image and Ground Truth Dataset of Historical Newspapers? [C] . Christian Clausner, Christos Papadopoulos, Stefan Pletschacher, International Conference on Document Analysis and Recognition . 2015

机译：历史报纸的enp图像和地面真理数据集？
5. Recovery of ground-truth pixel information from airborne hyperspectral images. [D] . Muktinutalapati, Kartik Chandra. 2007

机译：从机载高光谱图像中恢复地面真像素信息。
6. Corrigendum to A large dataset of synthetic SEM images of powder materials and their ground truth 3D structures Data Brief 9 (2016) 727–731 [O] . Brian L. DeCost, Elizabeth A. Holm 2018

机译：关于粉末材料及其地面真实3D结构的合成SEM图像的大型数据集的更正数据摘要9（2016）727–731
7. Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process [O] . Kimmo Kettunen, Mika Koistinen, Jukka Kervinen 2020

机译：芬兰历史报纸的地面真理OCR样本数据改进验证重新响应的验证

The ENP image and ground truth dataset of historical newspapers

摘要

著录项

相似文献

相关主题

期刊订阅