首页> 外文会议>IAPR International Conference on Document Analysis and Recognition >A Symbol Dominance Based Formulae Recognition Approach for PDF Documents
【24h】

A Symbol Dominance Based Formulae Recognition Approach for PDF Documents

机译:基于符号优势的PDF文档公式识别方法

获取原文

摘要

With more and more scientific documents becoming available in PDF format, recognition of formulae in these PDF documents is of great significance. In this paper, we propose a symbol dominance based formulae recognition approach to recovering formulae structures by using the rich information extracted directly from PDF files. The hierarchical structure of formula is represented by relationship tree, and the tree is built recursively based on symbol dominance, which considers both the spatial layout of symbols and the typesetting conventions of mathematics. In addition, we propose a special character recognition method to identify the formula characters with multiple components or variable unicode. Repeatable and comparable experiments have been done over two large datasets, IM2LATEX-100K and PDFME-10K. Experimental results demonstrate that our method is more adaptive and practical for PDF documents compared with other two existing available formulae recognition systems, INFTY and WYGIWYS.
机译:随着越来越多的科学文件以PDF格式提供,识别这些PDF文件中的公式具有重要意义。在本文中,我们提出了一种基于符号优势的公式识别方法,该方法利用从PDF文件直接提取的丰富信息来恢复公式结构。公式的层次结构由关系树表示,并且该树是基于符号优势递归构建的,该树考虑了符号的空间布局和数学的排版约定。此外,我们提出了一种特殊的字符识别方法,以识别具有多个组成部分或可变unicode的公式字符。已经在两个大型数据集IM2LATEX-100K和PDFME-10K上进行了可重复和可比较的实验。实验结果表明,与其他两个现有的可用公式识别系统INFTY和WYGIWYS相比,我们的方法对PDF文档更具适应性和实用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号