A Symbol Dominance Based Formulae Recognition Approach for PDF Documents

机译：基于符号优势的PDF文档的公式识别方法

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

With more and more scientific documents becoming available in PDF format, recognition of formulae in these PDF documents is of great significance. In this paper, we propose a symbol dominance based formulae recognition approach to recovering formulae structures by using the rich information extracted directly from PDF files. The hierarchical structure of formula is represented by relationship tree, and the tree is built recursively based on symbol dominance, which considers both the spatial layout of symbols and the typesetting conventions of mathematics. In addition, we propose a special character recognition method to identify the formula characters with multiple components or variable unicode. Repeatable and comparable experiments have been done over two large datasets, IM2LATEX-100K and PDFME-10K. Experimental results demonstrate that our method is more adaptive and practical for PDF documents compared with other two existing available formulae recognition systems, INFTY and WYGIWYS.

机译：随着越来越多的科学文件以PDF格式提供，这些PDF文件中的公式的识别具有重要意义。在本文中，我们通过使用直接从PDF文件提取的丰富的信息提出基于符号的基于占据总结的公式识别方法来恢复公式结构。公式的分层结构由关系树表示，并且基于符号主管递归地构建树，这考虑了符号的空间布局和数学的排序约定。此外，我们提出了一种特殊的字符识别方法来标识具有多个组件或可变Unicode的公式字符。在两个大型数据集，IM2LATEX-100K和PDFME-10K上进行了可重复和可比实验。实验结果表明，与其他两个现有的可用公式识别系统，漂白和WYGIWYS相比，我们的方法更适应和实用。

著录项

来源
《IAPR International Conference on Document Analysis and Recognition》|2017年|732p|共6页
会议地点
作者
Xiaode Zhang; Liangcai Gao; Ke Yuan; Runtao Liu; Zhuoren Jiang; Zhi Tang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41-53;
关键词
Portable document format; Data mining; Optical character recognition software; Grammar; Engines; Image recognition; Compounds;

机译：便携式文档格式;数据挖掘;光学字符识别软件;语法;发动机;图像识别;化合物;

相似文献

外文文献
中文文献
专利

1. Metadata Extraction Approach of PDF Documents Based on Measurement Fusion [J] . Junmin Zhao, Huazhong Liu Journal of Multimedia . 2013,第6期

机译：基于测量融合的PDF文档元数据提取方法
2. A co-training based entity recognition approach for cross-disease clinical documents [J] . Dehua Chen, Nannan Che, Jiajin Le, Concurrency and computation: practice and experience . 2019,第23期

机译：基于共同训练的跨疾病临床文档实体识别方法
3. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition [J] . Luo Ling, Yang Zhihao, Yang Pei, Bioinformatics . 2018,第8期

机译：基于注意力的Bilstm-CRF方法，文档级化学品名称实体识别
4. A Symbol Dominance Based Formulae Recognition Approach for PDF Documents [C] . Xiaode Zhang, Liangcai Gao, Ke Yuan, IAPR International Conference on Document Analysis and Recognition . 2017

机译：基于符号优势的PDF文档公式识别方法
5. The Wives and Sisters of Sahagun: Word Order of Latin and Romance Synonyms in Possessive Noun Phrases in the Formulae of Medieval Iberian Notarial Documents- Uxor vs. Mulier and Soror vs. Germana, a Preliminary Study. [D] . Lee, Jesse. 2013

机译：萨哈贡的妻子和姐妹：中世纪伊比利亚公证文件中的名词短语中拉丁和浪漫同义词的词序-初步研究：Uxor vs. Mulier和Soror vs. Germana。
6. The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance Symbol Diversity and Information Entropy [O] . Julio A. Camargo 2020

机译：Lorenz曲线：一个适当的框架用于定义符号优势符号分集和信息熵的令人满意的措施
7. A Region-Based Hashing Approach for Symbol Spotting in Technical Documents [O] . 2016

机译：基于区域的技术文档符号定位哈希方法

A Symbol Dominance Based Formulae Recognition Approach for PDF Documents

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅