A Deep Learning-Based Formula Detection Method for PDF Documents

机译：基于深度学习的PDF文档的公式检测方法

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In practice, PDF files may be generated by different tools and their character information quality could be different. As a result, the approaches to detecting formulae from PDF documents usually have much different performance on different PDF files. To address this problem, in this paper we combine and refine the Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model to detect formulae according to both their character and vision features. Based on the characteristic of PDF documents, we propose a series of strategies to train and optimize deep networks, such as the implicit class down-sampling strategy which can reduce the unbalancedness between formulae and other page elements (e.g., text paragraphs, tables, figures, etc.). The region proposal method is also redesigned to generate moderate formula candidates through combining the bottom-up and top-down layout analysis. The experimental results show that the combination of CNN and RNN can increase the robustness of our proposed detection method. Furthermore, the proposed method outperforms the existing formula detection methods on both a ground-truth dataset and a larger self-built dataset, which would be released and available for research purposes.

机译：在实践中，PDF文件可以由不同的工具生成，并且它们的字符信息质量可能不同。结果，从PDF文档中检测公式的方法通常在不同的PDF文件上具有很大的性能。为了解决这个问题，在本文中，我们将卷积神经网络（CNN）和经常性神经网络（RNN）模型相结合并优化根据其特征和视觉特征来检测公式。基于PDF文件的特征，我们提出了一系列策略来培训和优化深网络，例如隐式的逐步上采样策略，可以减少公式和其他页面元素之间的不平衡性（例如，文本段落，表格，表格，表格，表格，表格，等等。）。该区域提议方法还重新设计以通过组合自下而上和自上而下的布局分析来生成适度公式候选。实验结果表明，CNN和RNN的组合可以增加我们所提出的检测方法的鲁棒性。此外，所提出的方法优于地面真实数据集和更大的自动数据集的现有公式检测方法，这将被释放和可用于研究目的。

著录项

来源
《IAPR International Conference on Document Analysis and Recognition》|2017年|732p|共6页
会议地点
作者
Liangcai Gao; Xiaohan Yi; Yuan Liao; Zhuoren Jiang; Zuoyu Yan; Zhi Tang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41-53;
关键词
Portable document format; Feature extraction; Data mining; Task analysis; Layout; Machine learning; Streaming media;

机译：便携式文件格式;特征提取;数据挖掘;任务分析;布局;机器学习;流媒体;

相似文献

外文文献
中文文献
专利

1. Deep Learning-Based Document Modeling for Personality Detection from Text [J] . Navonil Majumder, Soujanya Poria, Alexander Gelbukh, IEEE intelligent systems . 2017,第2期

机译：基于深度学习的文档模型用于文本个性检测
2. On methods and tools of table detection, extraction and annotation in PDF documents [J] . Shah Khusro, Asima Latif, Irfan Ullah Journal of Information Science . 2015,第1期

机译：PDF文档中表格检测，提取和注释的方法和工具
3. Higgs self-coupling measurements using deep learning in the b b ˉ b b ˉ documentclass[12pt]{minimal} usepackage{amsmath} usepackage{wasysym} usepackage{amsfonts} usepackage{amssymb} usepackage{amsbsy} usepackage{mathrsfs} usepackage{upgreek} setlength{oddsidemargin}{-69pt} egin{document}$$ boverline{b}boverline{b} $$end{document} final state [J] . Jacob Amacker, William Balunas, Lydia Beresford, The journal of high energy physics . 2020,第12期

机译：HIGGS在<内联公式ID =“IEQ1”中使用深度学习的自耦测量> <替代方案> B b ˉ b B ˉ documentClass [12pt] {minimal} usepackage {ammath} usepackage {isysym} usepackage {amsfonts} usepackage {amssymb} usepackage {amsbsy} usepackage {mathrsfs } usepackage {supmeez} setLength { oddsideDemargin} { - 69pt} begin {document} $$ b overline {b} b overline {b} $$ end {document} <内联 - 绘图XLink：href =“13130_2020_14404_ARTICLE_IEQ1.gif”/> 最终状态
4. A Deep Learning-Based Formula Detection Method for PDF Documents [C] . Liangcai Gao, Xiaohan Yi, Yuan Liao, IAPR International Conference on Document Analysis and Recognition . 2017

机译：基于深度学习的PDF文档公式检测方法
5. Deep Learning-Based Methods for Detecting Foreign Matter Defect on the Molded Pulp Packaging Surface [D] . Cho, Euna. 2021

机译：基于深入的学习方法，用于检测模塑纸浆包装表面上的异物缺陷
6. Easy domain adaptation method for filling the species gap in deep learning-based fruit detection [O] . Wenli Zhang, Kaizhen Chen, Jiaqi Wang, 2021

机译：易于域适应方法用于填充基于深度学习的水果检测中的物种间隙
7. Comparison of deep learning-based methods in multimodal anomaly detection: A case study in human–robot collaboration [O] . Lin Yang, Wu Yan, Hongmin Wu 2021

机译：基于深度学习的多模式异常检测方法的比较 - 以人 - 机器人协作为例

A Deep Learning-Based Formula Detection Method for PDF Documents

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅