Exchanging image processing and OCR components in a Setswana digitisation pipeline

Gideon Jozua Kotzé; Friedel Wolff

摘要

As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data. Author Biographies Senior Researcher Academy of African Languages and Science College of Graduate Studies.

机译：随着更多自然语言处理（NLP）应用程序受益于基于神经网络的方法，在NLP中重新评估现有工作是有意义的。用于数字化的完整管道包括依次处理材料的若干组件。扫描文档后的图像处理已被证明是最终质量的重要因素。在这里，我们比较两个不同的方法来在光学字符识别前视觉增强文档（OCR），（1）ApageMagick和Unaper和（2）ocropus的组合。我们还使用神经网络的基于线路的OCR包进行比较Calamari，具有众所周知的TESSERACT 3作为OCR组件。我们对一组Setswana文档的评估表明，ImageMagick / Unaper和Calamari的组合在基于TESSERACT 3和ImageMagick / Unaper的当前基线上有超过30％，在所有组合测试数据中实现了1.69的平均字符错误率。作者传记高级研究员非洲语言学院和研究生科学学院。

Exchanging image processing and OCR components in a Setswana digitisation pipeline

摘要

著录项

相关主题

期刊订阅