首页> 外文期刊>Technologies >Data-Driven Recognition and Extraction of PDF Document Elements
【24h】

Data-Driven Recognition and Extraction of PDF Document Elements

机译:数据驱动的PDF文档元素的识别和提取

获取原文
           

摘要

In the age of digitalization, the collection and analysis of large amounts of data is becoming increasingly important for enterprises to improve their businesses and processes, such as the introduction of new services or the realization of resource-efficient production. Enterprises concentrate strongly on the integration, analysis and processing of their data. Unfortunately, the majority of data analysis focuses on structured and semi-structured data, although unstructured data such as text documents or images account for the largest share of all available enterprise data. One reason for this is that most of this data is not machine-readable and requires dedicated analysis methods, such as natural language processing for analyzing textual documents or object recognition for recognizing objects in images. Especially in the latter case, the analysis methods depend strongly on the application. However, there are also data formats, such as PDF documents, which are not machine-readable and consist of many different document elements such as tables, figures or text sections. Although the analysis of PDF documents is a major challenge, they are used in all enterprises and contain various information that may contribute to analysis use cases. In order to enable their efficient retrievability and analysis, it is necessary to identify the different types of document elements so that we are able to process them with tailor-made approaches. In this paper, we propose a system that forms the basis for structuring unstructured PDF documents, so that the identified document elements can subsequently be retrieved and analyzed with tailor-made approaches. Due to the high diversity of possible document elements and analysis methods, this paper focuses on the automatic identification and extraction of data visualizations, algorithms, other diagram-like objects and tables from a mixed document body. For that, we present two different approaches. The first approach uses methods from the area of deep learning and rule-based image processing whereas the second approach is purely based on deep learning. To train our neural networks, we manually annotated a large corpus of PDF documents with our own annotation tool, of which both are being published together with this paper. The results of our extraction pipeline show that we are able to automatically extract graphical items with a precision of 0.73 and a recall of 0.8. For tables, we reach a precision of 0.78 and a recall of 0.94.
机译:在数字化时代,海量数据的收集和分析对于企业改善业务和流程(例如引入新服务或实现资源节约型生产)变得越来越重要。企业非常专注于数据的集成,分析和处理。不幸的是,尽管非结构化数据(例如文本文档或图像)在所有可用企业数据中占最大份额,但大多数数据分析都集中在结构化和半结构化数据上。其原因之一是大多数数据不是机器可读的,并且需要专用的分析方法,例如用于分析文本文档的自然语言处理或用于识别图像中对象的对象识别。特别是在后一种情况下,分析方法在很大程度上取决于应用程序。但是,还有一些数据格式(例如PDF文档)不是机器可读的,并且由许多不同的文档元素(例如表格,图形或文本部分)组成。尽管对PDF文档的分析是一个主要挑战,但它们在所有企业中都得到使用,并且包含各种可能有助于分析用例的信息。为了使它们能够高效地进行检索和分析,有必要识别不同类型的文档元素,以便我们能够使用量身定制的方法对其进行处理。在本文中,我们提出了一个系统,该系统构成了构建非结构化PDF文档的基础,以便随后可以使用量身定制的方法来检索和分析已标识的文档元素。由于可能的文档元素和分析方法的多样性,本文着重于从混合文档主体中自动识别和提取数据可视化,算法,其他类似于图表的对象和表格。为此,我们提出了两种不同的方法。第一种方法使用深度学习和基于规则的图像处理领域的方法,而第二种方法则完全基于深度学习。为了训练我们的神经网络,我们使用自己的注释工具手动注释了一大批PDF文档,这些文档都将与本文一起发布。提取流水线的结果表明,我们能够以0.73的精度和0.8的召回率自动提取图形项。对于表格,我们的精度为0.78,召回率为0.94。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号