首页> 外文会议>VLSI Multilevel Interconnection Conference, 1990. >Towards a canonical and structured representation of PDF documents through reverse engineering
【24h】

Towards a canonical and structured representation of PDF documents through reverse engineering

机译:通过反向工程实现PDF文档的规范化和结构化表示

获取原文
获取原文并翻译 | 示例

摘要

This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original document layout structure. Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type. This article first reviews the major traps and tricks of the PDF format. It then introduces the architecture of Xed along with its main modules, and, in particular, the document physical structure extraction algorithm. Later on, a canonical format is proposed and discussed with an example. Finally the results of a practical evaluation are presented, followed by an outline of future works on the logical structure extraction.
机译:本文介绍了Xed,这是一种用于PDF文档的逆向工程工具,它可以提取原始的文档布局结构。 Xed将电子提取方法与最新的文档分析技术相结合,并以分层的规范形式输出布局结构,即通用且独立于文档类型的布局结构。本文首先回顾了PDF格式的主要陷阱和技巧。然后介绍Xed的体系结构及其主要模块,特别是文档物理结构提取算法。稍后,将提出规范格式并通过示例进行讨论。最后,介绍了实际评估的结果,然后概述了逻辑结构提取的未来工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号