PubLayNet: Largest Dataset Ever for Document Layout Analysis

机译：PubLayNet：有史以来最大的文档布局分析数据集

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset (https://github.com/ibm-aur-nlp/PubLayNet) to support development and evaluation of more advanced models for document layout analysis.

机译：将文档解析为下游应用程序的结构化机器可读格式时，识别非结构化数字文档的布局是重要的一步。为计算机视觉开发的深度神经网络已被证明是分析文档图像布局的有效方法。但是，当前公开可用的文档布局数据集比已建立的计算视觉数据集小几个数量级。必须通过在传统计算机视觉数据集上进行预训练的基本模型的转移学习来训练模型。在本文中，我们通过自动匹配XML表示形式和PubMed Central上公开提供的超过100万PDF文章的内容，开发了用于文档布局分析的PubLayNet数据集。数据集的大小可与已建立的计算机视觉数据集相媲美，其中包含超过36万个文档图像，其中注释了典型的文档布局元素。实验表明，在PubLayNet上训练的深度神经网络可以准确地识别科学文章的布局。预先训练的模型也是在不同文档域上进行转移学习的更有效的基本模式。我们发布了数据集（https://github.com/ibm-aur-nlp/PubLayNet），以支持开发和评估用于文档布局分析的更高级模型。

著录项

来源
《International Conference on Document Analysis and Recognition》|2019年|1015-1022|共8页
会议地点
作者
Xu Zhong; Jianbin Tang; Antonio Jimeno Yepes;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Layout; Retina; XML; Proteins; Veins; Text analysis; Australia;

机译：布局;视网膜; XML;蛋白质;静脉;文本分析;澳大利亚;
入库时间 2022-08-26 14:34:51

相似文献

外文文献
中文文献
专利

1. An integration of gauge, satellite, and reanalysis precipitation datasets for the largest river basin of the Tibetan Plateau [J] . Yuanwei Wang, Lei Wang, Xiuping Li, Earth System Science Data . 2020,第3期

机译：藏，卫星和加热分析降水数据集的集成，藏高原最大的河流盆地
2. Comparative Study of Layout Analysis of Tabulated Historical Documents [J] . Liang Xusheng, Cheddad Abbas, Hall Johan Big Data Research . 2021,第1期

机译：表现历史文献布局分析的比较研究
3. BINYAS: a complex document layout analysis system [J] . Bhowmik Showmik, Kundu Soumyadeep, Sarkar Ram Multimedia Tools and Applications . 2021,第6期

机译：BINYAS：复杂的文档布局分析系统
4. PubLayNet: Largest Dataset Ever for Document Layout Analysis [C] . Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes International Conference on Document Analysis and Recognition . 2019

机译：PublayNet：有史以来，用于文档布局分析的最大数据集
5. Supernova Classification and Supernova Astrophysics: Spectral Analysis of the Largest Datasets of Stripped-Envelope Supernovae in the World [D] . Liu, Yuqian. 2017

机译：超新星分类和超新星天体物理学：世界上最大的剥离信封超新星数据集的光谱分析
6. MOWDOC: A Dataset of Documents From Taking the Measure of Work for Building a Latent Semantic Analysis Space [O] . Kim F. Nimon 2020

机译：mowdoc：从衡量建立潜在语义分析空间的工作的文件数据集
7. PubLayNet: Largest Dataset Ever for Document Layout Analysis [O] . Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes 2019

机译：PublayNet：有史以来，用于文档布局分析的最大数据集

PubLayNet: Largest Dataset Ever for Document Layout Analysis

摘要

著录项

相似文献

相关主题

期刊订阅