首页> 外文会议>IAPR International Conference on Document Analysis and Recognition >A Deep Learning-Based Formula Detection Method for PDF Documents
【24h】

A Deep Learning-Based Formula Detection Method for PDF Documents

机译:基于深度学习的PDF文档的公式检测方法

获取原文

摘要

In practice, PDF files may be generated by different tools and their character information quality could be different. As a result, the approaches to detecting formulae from PDF documents usually have much different performance on different PDF files. To address this problem, in this paper we combine and refine the Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model to detect formulae according to both their character and vision features. Based on the characteristic of PDF documents, we propose a series of strategies to train and optimize deep networks, such as the implicit class down-sampling strategy which can reduce the unbalancedness between formulae and other page elements (e.g., text paragraphs, tables, figures, etc.). The region proposal method is also redesigned to generate moderate formula candidates through combining the bottom-up and top-down layout analysis. The experimental results show that the combination of CNN and RNN can increase the robustness of our proposed detection method. Furthermore, the proposed method outperforms the existing formula detection methods on both a ground-truth dataset and a larger self-built dataset, which would be released and available for research purposes.
机译:在实践中,PDF文件可以由不同的工具生成,并且它们的字符信息质量可能不同。结果,从PDF文档中检测公式的方法通常在不同的PDF文件上具有很大的性能。为了解决这个问题,在本文中,我们将卷积神经网络(CNN)和经常性神经网络(RNN)模型相结合并优化根据其特征和视觉特征来检测公式。基于PDF文件的特征,我们提出了一系列策略来培训和优化深网络,例如隐式的逐步上采样策略,可以减少公式和其他页面元素之间的不平衡性(例如,文本段落,表格,表格,表格,表格,表格, 等等。)。该区域提议方法还重新设计以通过组合自下而上和自上而下的布局分析来生成适度公式候选。实验结果表明,CNN和RNN的组合可以增加我们所提出的检测方法的鲁棒性。此外,所提出的方法优于地面真实数据集和更大的自动数据集的现有公式检测方法,这将被释放和可用于研究目的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号