首页> 外文期刊>South African Computer Journal >Exchanging image processing and OCR components in a Setswana digitisation pipeline
【24h】

Exchanging image processing and OCR components in a Setswana digitisation pipeline

机译:在Setswana数字化管道中交换图像处理和OCR组件

获取原文
       

摘要

As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data. Author Biographies Senior Researcher Academy of African Languages and Science College of Graduate Studies.
机译:随着更多自然语言处理(NLP)应用程序受益于基于神经网络的方法,在NLP中重新评估现有工作是有意义的。用于数字化的完整管道包括依次处理材料的若干组件。扫描文档后的图像处理已被证明是最终质量的重要因素。在这里,我们比较两个不同的方法来在光学字符识别前视觉增强文档(OCR),(1)ApageMagick和Unaper和(2)ocropus的组合。我们还使用神经网络的基于线路的OCR包进行比较Calamari,具有众所周知的TESSERACT 3作为OCR组件。我们对一组Setswana文档的评估表明,ImageMagick / Unaper和Calamari的组合在基于TESSERACT 3和ImageMagick / Unaper的当前基线上有超过30%,在所有组合测试数据中实现了1.69的平均字符错误率。作者传记高级研究员非洲语言学院和研究生科学学院。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号