首页> 外文会议> >Intelligent Text Extraction from PDF Documents
【24h】

Intelligent Text Extraction from PDF Documents

机译:从PDF文档中智能提取文本

获取原文

摘要

In recent years, PDF has become the de-facto standard for the exchange of print-oriented documents on the Web. This includes many business documents such as financial reports, newsletters and patent applications, and there are many commercial applications that require data to be extracted from these documents and processed by computer systems. A number of products currently exist on the market that navigate, extract and transform data from HTML pages; a process known as wrapping. One such methodology is Lixto1, a product of research at our institute. However, none of these products are currently able to work with PDF files. We are investigating this possibility as part of the NEXTWRAP project. This paper describes our work in progress, and details some of the low-level page segmentation techniques that we have investigated.
机译:近年来,PDF已成为事实上的标准,用于在Web上交换面向打印的文档。这包括许多商业文档,例如财务报告,新闻通讯和专利申请,并且有许多商业应用程序要求从这些文档中提取数据并由计算机系统处理。市场上当前有许多产品可以从HTML页面导航,提取和转换数据。被称为包装的过程。一种这样的方法学是我们研究所的研究成果Lixto1。但是,这些产品当前都无法使用PDF文件。我们正在研究这种可能性,作为NEXTWRAP项目的一部分。本文介绍了我们正在进行的工作,并详细介绍了我们研究的一些低级页面分割技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号