首页> 外文会议>IAPR International Conference on Document Analysis and Recognition >A Symbol Dominance Based Formulae Recognition Approach for PDF Documents
【24h】

A Symbol Dominance Based Formulae Recognition Approach for PDF Documents

机译:基于符号优势的PDF文档的公式识别方法

获取原文

摘要

With more and more scientific documents becoming available in PDF format, recognition of formulae in these PDF documents is of great significance. In this paper, we propose a symbol dominance based formulae recognition approach to recovering formulae structures by using the rich information extracted directly from PDF files. The hierarchical structure of formula is represented by relationship tree, and the tree is built recursively based on symbol dominance, which considers both the spatial layout of symbols and the typesetting conventions of mathematics. In addition, we propose a special character recognition method to identify the formula characters with multiple components or variable unicode. Repeatable and comparable experiments have been done over two large datasets, IM2LATEX-100K and PDFME-10K. Experimental results demonstrate that our method is more adaptive and practical for PDF documents compared with other two existing available formulae recognition systems, INFTY and WYGIWYS.
机译:随着越来越多的科学文件以PDF格式提供,这些PDF文件中的公式的识别具有重要意义。在本文中,我们通过使用直接从PDF文件提取的丰富的信息提出基于符号的基于占据总结的公式识别方法来恢复公式结构。公式的分层结构由关系树表示,并且基于符号主管递归地构建树,这考虑了符号的空间布局和数学的排序约定。此外,我们提出了一种特殊的字符识别方法来标识具有多个组件或可变Unicode的公式字符。在两个大型数据集,IM2LATEX-100K和PDFME-10K上进行了可重复和可比实验。实验结果表明,与其他两个现有的可用公式识别系统,漂白和WYGIWYS相比,我们的方法更适应和实用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号