首页> 外文会议>2017 ACM/IEEE Joint Conference on Digital Libraries >A Benchmark and Evaluation for Text Extraction from PDF
【24h】

A Benchmark and Evaluation for Text Extraction from PDF

机译:从PDF提取文本的基准和评估

获取原文
获取原文并翻译 | 示例

摘要

Extracting the body text from a PDF document is an important but surprisingly difficult task. The reason is that PDF is a layout-based format which specifies the fonts and positions of the individual characters rather than the semantic units of the text (e.g., words or paragraphs) and their role in the document (e.g., body text or caption). There is an abundance of extraction tools, but their quality and the range of their functionality are hard to determine. In this paper, we show how to construct a high-quality benchmark of principally arbitrary size from parallel TeX and PDF data. We construct such a benchmark of 12,098 scientific articles from arXiv.org and make it publicly available. We establish a set of criteria for a clean and independent assessment of the semantic abilities of a given extraction tool. We provide an extensive evaluation of 14 state-of-the-art tools for text extraction from PDF on our benchmark according to our criteria. We include our own method, Icecite, which significantly outperforms all other tools, but is still not perfect. We outline the remaining steps necessary to finally make text extraction from PDF a "solved problem".
机译:从PDF文档中提取正文是一项重要的任务,但出乎意料的困难。原因是PDF是一种基于布局的格式,它指定各个字符的字体和位置,而不是文本(例如,单词或段落)的语义单元及其在文档中的作用(例如,正文或标题)。 。提取工具很多,但是它们的质量和功能范围很难确定。在本文中,我们展示了如何从并行TeX和PDF数据构建主要是任意大小的高质量基准。我们从arXiv.org构建了这样的基准,其中包含12,098篇科学文章,并公开发布。我们建立了一套标准,用于对给定提取工具的语义能力进行干净且独立的评估。我们根据基准对14种最先进的工具进行了广泛的评估,这些工具可从我们的基准中从PDF提取文本。我们包含了自己的方法Icecite,该方法明显优于所有其他工具,但仍然不够完美。我们概述了最终使从PDF提取文本成为“已解决的问题”所需的其余步骤。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号