首页> 外文会议>International Conference on Document Analysis and Recognition >PubLayNet: Largest Dataset Ever for Document Layout Analysis

PubLayNet: Largest Dataset Ever for Document Layout Analysis




Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset (https://github.com/ibm-aur-nlp/PubLayNet) to support development and evaluation of more advanced models for document layout analysis.
机译:将文档解析为下游应用程序的结构化机器可读格式时,识别非结构化数字文档的布局是重要的一步。为计算机视觉开发的深度神经网络已被证明是分析文档图像布局的有效方法。但是,当前公开可用的文档布局数据集比已建立的计算视觉数据集小几个数量级。必须通过在传统计算机视觉数据集上进行预训练的基本模型的转移学习来训练模型。在本文中,我们通过自动匹配XML表示形式和PubMed Central上公开提供的超过100万PDF文章的内容,开发了用于文档布局分析的PubLayNet数据集。数据集的大小可与已建立的计算机视觉数据集相媲美,其中包含超过36万个文档图像,其中注释了典型的文档布局元素。实验表明,在PubLayNet上训练的深度神经网络可以准确地识别科学文章的布局。预先训练的模型也是在不同文档域上进行转移学习的更有效的基本模式。我们发布了数据集(https://github.com/ibm-aur-nlp/PubLayNet),以支持开发和评估用于文档布局分析的更高级模型。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号