Automatic Recovery of Corrupted Font Encoding in PDF Documents Using CNN-based Symbol Recognition with Language Model

机译：使用基于CNN的符号识别和语言模型自动恢复PDF文档中损坏的字体编码

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we present a novel solution for the problem of text extraction from PDF documents with incorrect encoding of embedded fonts. Instead of using a full-scale optical character recognition (OCR) for such documents, our system detects problematic fonts and focuses on their restoration, rather than the actual text. It parsimoniously uses a convolutional neural network (CNN) for performing optical character recognition (OCR) of individual glyphs, classifying them to homoglyph groups. Font glyph-character mappings are recovered by combining glyph location information with a language model. We compare our approach against a full-scale OCR baseline on a new dataset of multilingual PDF files that we made available to the research community. Our evaluation on a variety of test cases shows that our approach has a consistently lower error rate and a significantly faster processing time.

机译：在本文中，我们提出了一种新颖的解决方案，可以解决从嵌入字体不正确编码的PDF文档中提取文本的问题。我们的系统不会对此类文档使用全面的光学字符识别（OCR），而是检测有问题的字体，并着重于字体的恢复，而不是实际的文本。它简约地使用卷积神经网络（CNN）对单个字形进行光学字符识别（OCR），并将它们分类为同形字组。通过将字形位置信息与语言模型结合来恢复字体字形-字符映射。我们在提供给研究社区的多语言PDF文件的新数据集上，将我们的方法与全面OCR基线进行了比较。我们对各种测试用例的评估表明，我们的方法始终具有较低的错误率和明显更快的处理时间。

著录项

来源
《IAPR International Workshop on Document Analysis Systems》|2018年|121-126|共6页
会议地点
作者
Mark Vol; Andrey Krutko; Nicolas Stefanovitch; Denis Postanogov;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Portable document format; Encoding; Optical character recognition software; Google; Visualization; Task analysis; Adaptive optics;

机译：便携式文件格式;编码;光学字符识别软件; Google;可视化;任务分析;自适应光学;

相似文献

外文文献
中文文献
专利

1. Investigations and HPLC Assay of ModelFormulations Containing Amlodipine Besylate and Lisinopril” lang="EN-GB" style="font-size:12.0pt;font-family:"Calibri","sans-serif";mso-fareast-font-family:"Times New Roman";mso-ansi-language:EN-GB;mso-fareast-language:EN-US;mso-bidi-language:AR-SA;mso-bidi-font-weight [J] . International Journal of Pharmaceutical Sciences Review and Research . 2013,第1期

机译：含有苯磺酸氨氯地平和利诺普利的模型制剂的研究和HPLC分析” -family：“ Times New Roman”; mso-ansi语言：EN-GB; mso-fareast语言：EN-US; mso-bidi语言：AR-SA; mso-bidi-font-weight
2. Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents [J] . International journal of applied mechanics . 2020,第3期

机译：自动语言处理模型，用于自动分析网络安全相关文件
3. Automatic Classification of Documents in a Natural Language: A Conceptual Model [J] . N. D. Lyfenko Automatic Documentation and Mathematical Linguistics . 2014,第3期

机译：自然语言的文档自动分类：概念模型
4. Automatic Recovery of Corrupted Font Encoding in PDF Documents Using CNN-based Symbol Recognition with Language Model [C] . Mark Vol, Andrey Krutko, Nicolas Stefanovitch, IAPR International Workshop on Document Analysis Systems . 2018

机译：使用基于CNN的符号识别使用基于CNN的符号识别，自动恢复PDF文档中的损坏字体编码
5. Automatic semantic header generator for PDF documents [D] . Xue, Furong 2004

机译：PDF文档的自动语义头生成器
6. Efficient Caoshu Character Recognition Scheme and Service Using CNN-Based Recognition Model Optimization [O] . Boseon Hong, Bongjae Kim 2020

机译：高效的CNOSO字符识别方案和服务使用基于CNN的识别模型优化
7. Tendências tecnológicas de polietilenos e polipropileno através da prospecção em documentos de patente nos Estados Unidos e Europa - 1990/1997 Technological trends for polyethylene and polypropylene through the study of patent documents in the United States and Europe - 1990/1997 [O] . Adelaide M. S. Antunes, Roberto G. Giannini, Suzana Borschiver, 2000

机译：通过在美国和欧洲的专利文件中进行探查来了解聚乙烯和聚丙烯的技术趋势- 1990/1997通过研究在美国和欧洲的专利文件来了解聚乙烯和聚丙烯的技术趋势- 1990/1997

Automatic Recovery of Corrupted Font Encoding in PDF Documents Using CNN-based Symbol Recognition with Language Model

摘要

著录项

相似文献

相关主题

期刊订阅