首页> 外文会议>Conference on Intelligent Text Processing and Computational Linguistics;CICLing 2014 >Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing*
【24h】

Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing*

机译:改进了PDF文档的文本提取,用于大规模的自然语言处理*

获取原文

摘要

The inability of reliable text extraction from arbitrary documents is often an obstacle for large scale NLP based on resources crawled from the Web. One of the largest problems in the conversion of PDF documents is the detection of the boundaries of common textual units such as paragraphs, sentences and words. PDF is a file format optimized for printing and encapsulates a complete description of the layout of a document including text, fonts, graphics and so on. This paper describes a tool for extracting texts from arbitrary PDF files for the support of largescale data-driven natural language processing. Our approach combines the benefits of several existing solutions for the conversion of PDF documents to plain text and adds a language-independent post-processing procedure that cleans the output for further linguistic processing. In particular, we use the PDF-rendering libraries pdfXtk, Apache Tika and Poppler in various configurations. From the output of these tools we recover proper boundaries using on-the-fly language models and languageindependent extraction heuristics. In our research, we looked especially at publications from the European Union, which constitute a valuable multilingual resource, for example, for training statistical machine translation models. We use our tool for the conversion of a large multilingual database crawled from the EU bookshop with the aim of building parallel corpora. Our experiments show that our conversion software is capable of fixing various common issues leading to cleaner data sets in the end.
机译:任意文档的可靠文本提取通常是基于从网络爬行的资源的大规模NLP的障碍。 PDF文件转换中最大的问题之一是检测常见文本单位的界限,如段落,句子和单词。 PDF是针对打印优化的文件格式,并封装了文档布局的完整描述,包括文本,字体,图形等。本文介绍了一种用于从任意PDF文件中提取文本的工具,以支持Largescale数据驱动的自然语言处理。我们的方法将若干现有解决方案的优势与纯文本转换为纯文本,并添加了一种独立于语言的后处理程序,可以清除输出以获取进一步的语言处理。特别是,我们使用PDF渲染库PDFXTK,Apache TIKA和POPPLER以各种配置。从这些工具的输出,我们使用飞行语言模型和LanguageIndevependent提取启发式恢复适当的边界。在我们的研究中,我们看起来特别关注来自欧盟的出版物,这构成了一个有价值的多语言资源,例如,用于培训统计机器翻译模型。我们使用我们的工具转换从欧盟书店逐步爬行的大型多语言数据库,其目的是建立平行对象。我们的实验表明,我们的转换软件能够修复各种常见问题,导致最终的清洁数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号