首页> 外文会议>IAPR International Workshop on Document Analysis Systems >Automatic Recovery of Corrupted Font Encoding in PDF Documents Using CNN-based Symbol Recognition with Language Model
【24h】

Automatic Recovery of Corrupted Font Encoding in PDF Documents Using CNN-based Symbol Recognition with Language Model

机译:使用基于CNN的符号识别和语言模型自动恢复PDF文档中损坏的字体编码

获取原文

摘要

In this paper, we present a novel solution for the problem of text extraction from PDF documents with incorrect encoding of embedded fonts. Instead of using a full-scale optical character recognition (OCR) for such documents, our system detects problematic fonts and focuses on their restoration, rather than the actual text. It parsimoniously uses a convolutional neural network (CNN) for performing optical character recognition (OCR) of individual glyphs, classifying them to homoglyph groups. Font glyph-character mappings are recovered by combining glyph location information with a language model. We compare our approach against a full-scale OCR baseline on a new dataset of multilingual PDF files that we made available to the research community. Our evaluation on a variety of test cases shows that our approach has a consistently lower error rate and a significantly faster processing time.
机译:在本文中,我们提出了一种新颖的解决方案,可以解决从嵌入字体不正确编码的PDF文档中提取文本的问题。我们的系统不会对此类文档使用全面的光学字符识别(OCR),而是检测有问题的字体,并着重于字体的恢复,而不是实际的文本。它简约地使用卷积神经网络(CNN)对单个字形进行光学字符识别(OCR),并将它们分类为同形字组。通过将字形位置信息与语言模型结合来恢复字体字形-字符映射。我们在提供给研究社区的多语言PDF文件的新数据集上,将我们的方法与全面OCR基线进行了比较。我们对各种测试用例的评估表明,我们的方法始终具有较低的错误率和明显更快的处理时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号