首页> 外文会议>ACM/IEEE on joint conference on digital libraries >Structure Extraction from PDF-based Book Documents
【24h】

Structure Extraction from PDF-based Book Documents

机译:基于PDF的书籍文件的结构提取

获取原文

摘要

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.
机译:现在PDF文档已经成为学界和业界既是主导知识库主要是因为他们都非常方便打印和交流。然而,自动化的结构信息提取方法尚未得到充分挖掘和缺乏有效的方法阻碍了PDF文档的信息重用。为加强对PDF格式电子书的实用性,我们提出了一个新的计算框架来分析其内在的物理结构和逻辑结构。该分析是在两个页面级和文档级别,包括全球typographies进行的,读取顺序,逻辑元件,章节/部分的层次结构和元数据。此外,基于PDF的书,即整本书文档和PDF文件的自然地呈现顺序风格的一致性,完全本文利用两个特点改进传统的基于图像的结构提取方法。本文采用二分图作为建模的各种任务,其中包括读取顺序恢复,图形和字幕的关联,和元数据提取的公共结构。基于该图表示,最佳匹配(OM)方法被用于找到在这些任务的全局最优。使用真实世界的数据广泛的基准测试验证了该方法的高效率和辨别能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号