Data-Driven Recognition and Extraction of PDF Document Elements

Matthias Hansen; André Pomp; Kemal Erki; Tobias Meisen

首页> 外文期刊>Technologies >Data-Driven Recognition and Extraction of PDF Document Elements

【24h】

Data-Driven Recognition and Extraction of PDF Document Elements

机译：数据驱动的PDF文档元素的识别和提取

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the age of digitalization, the collection and analysis of large amounts of data is becoming increasingly important for enterprises to improve their businesses and processes, such as the introduction of new services or the realization of resource-efficient production. Enterprises concentrate strongly on the integration, analysis and processing of their data. Unfortunately, the majority of data analysis focuses on structured and semi-structured data, although unstructured data such as text documents or images account for the largest share of all available enterprise data. One reason for this is that most of this data is not machine-readable and requires dedicated analysis methods, such as natural language processing for analyzing textual documents or object recognition for recognizing objects in images. Especially in the latter case, the analysis methods depend strongly on the application. However, there are also data formats, such as PDF documents, which are not machine-readable and consist of many different document elements such as tables, figures or text sections. Although the analysis of PDF documents is a major challenge, they are used in all enterprises and contain various information that may contribute to analysis use cases. In order to enable their efficient retrievability and analysis, it is necessary to identify the different types of document elements so that we are able to process them with tailor-made approaches. In this paper, we propose a system that forms the basis for structuring unstructured PDF documents, so that the identified document elements can subsequently be retrieved and analyzed with tailor-made approaches. Due to the high diversity of possible document elements and analysis methods, this paper focuses on the automatic identification and extraction of data visualizations, algorithms, other diagram-like objects and tables from a mixed document body. For that, we present two different approaches. The first approach uses methods from the area of deep learning and rule-based image processing whereas the second approach is purely based on deep learning. To train our neural networks, we manually annotated a large corpus of PDF documents with our own annotation tool, of which both are being published together with this paper. The results of our extraction pipeline show that we are able to automatically extract graphical items with a precision of 0.73 and a recall of 0.8. For tables, we reach a precision of 0.78 and a recall of 0.94.

机译：在数字化时代，海量数据的收集和分析对于企业改善业务和流程（例如引入新服务或实现资源节约型生产）变得越来越重要。企业非常专注于数据的集成，分析和处理。不幸的是，尽管非结构化数据（例如文本文档或图像）在所有可用企业数据中占最大份额，但大多数数据分析都集中在结构化和半结构化数据上。其原因之一是大多数数据不是机器可读的，并且需要专用的分析方法，例如用于分析文本文档的自然语言处理或用于识别图像中对象的对象识别。特别是在后一种情况下，分析方法在很大程度上取决于应用程序。但是，还有一些数据格式（例如PDF文档）不是机器可读的，并且由许多不同的文档元素（例如表格，图形或文本部分）组成。尽管对PDF文档的分析是一个主要挑战，但它们在所有企业中都得到使用，并且包含各种可能有助于分析用例的信息。为了使它们能够高效地进行检索和分析，有必要识别不同类型的文档元素，以便我们能够使用量身定制的方法对其进行处理。在本文中，我们提出了一个系统，该系统构成了构建非结构化PDF文档的基础，以便随后可以使用量身定制的方法来检索和分析已标识的文档元素。由于可能的文档元素和分析方法的多样性，本文着重于从混合文档主体中自动识别和提取数据可视化，算法，其他类似于图表的对象和表格。为此，我们提出了两种不同的方法。第一种方法使用深度学习和基于规则的图像处理领域的方法，而第二种方法则完全基于深度学习。为了训练我们的神经网络，我们使用自己的注释工具手动注释了一大批PDF文档，这些文档都将与本文一起发布。提取流水线的结果表明，我们能够以0.73的精度和0.8的召回率自动提取图形项。对于表格，我们的精度为0.78，召回率为0.94。

著录项

来源
《Technologies》 |2019年第3期|共19页
作者
Matthias Hansen; André Pomp; Kemal Erki; Tobias Meisen;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类一般工业技术;
关键词
PDF extractionmachine learningdata corpusdata processingunstructured data;

机译：PDF提取机器学习数据语料库数据处理非结构化数据;

相似文献

外文文献
中文文献
专利

1. Rule Based Chunk Extraction from PDF Documents Using Regular Expressions and Natural Language Processing [J] . Amol Rajaram Karad, Rahul Raghvendra Joshi International journal of computational intelligence research . 2021,第1期

机译：使用正则表达式和自然语言处理从PDF文档的规则的块提取
2. On methods and tools of table detection, extraction and annotation in PDF documents [J] . Shah Khusro, Asima Latif, Irfan Ullah Journal of Information Science . 2015,第1期

机译：PDF文档中表格检测，提取和注释的方法和工具
3. Rule Based Chunk Extraction from PDF Documents Using Regular Expressions and Natural Language Processing [J] . Amol Rajaram Karad, Rahul Raghvendra Joshi International Journal of Applied Engineering Research . 2015,第3期

机译：使用正则表达式和自然语言处理从PDF文档中基于规则的块提取
4. Extraction of Math Expressions from PDF Documents Based on Unsupervised Modeling of Fonts [C] . Zelun Wang, Donald Beyette, Jason Lin, International Conference on Document Analysis and Recognition . 2019

机译：基于无监督字体建模的PDF文档数学表达式的提取
5. Automatic semantic header generator for PDF documents [D] . Xue, Furong 2004

机译：PDF文档的自动语义头生成器
6. A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models [O] . Dharitri Misra, Siyuan Chen, George R. Thoma -1

机译：使用布局识别和字符串模式搜索模型从扫描文档中自动提取元数据的系统
7. Data-Driven Recognition and Extraction of PDF Document Elements [O] . Matthias Hansen, André Pomp, Kemal Erki, 2019

机译：数据驱动的PDF文档元素的识别和提取

Data-Driven Recognition and Extraction of PDF Document Elements

摘要

著录项

相似文献

相关主题

期刊订阅