首页> 外文会议>Document recognition and retrieval XXII >Cross-reference identification within a PDF document
【24h】

Cross-reference identification within a PDF document

机译:PDF文档中的交叉引用标识

获取原文
获取原文并翻译 | 示例

摘要

Cross-references, such like footnotes, endnotes, figure/table captions, references, are a common and useful type of page elements to further explain their corresponding entities in the target document. In this paper, we focus on cross-reference identification in a PDF document, and present a robust method as a case study of identifying footnotes and figure references. The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document. A number of novel features within a PDF document, i.e., page layout, font information, lexical and linguistic features of cross-references, are utilized for the task. Clustering is adopted to handle the features that are stable in one document but varied in different kinds of documents so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify cross-reference in a PDF document.
机译:交叉引用(例如脚注,尾注,图形/表格标题,参考)是一种常见且有用的页面元素类型,用于进一步解释目标文档中的相应实体。在本文中,我们专注于PDF文档中的交叉引用识别,并提出了一种可靠的方法,作为识别脚注和图形引用的案例研究。所提出的方法首先提取脚注和图形标题,然后将它们与文档中它们的相应引用进行匹配。 PDF文档中的许多新颖功能(例如页面布局,字体信息,交叉引用的词汇和语言功能)均用于该任务。采用聚类来处理在一个文档中稳定但在不同类型的文档中变化的特征,从而使识别过程与文档类型相适应。另外,该方法利用匹配过程的结果为识别过程提供反馈,并进一步提高算法的准确性。实际文档集中的主要实验表明,该方法有望用于识别PDF文档中的交叉引用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号