首页> 外文会议>IAPR International Workshop on Document Analysis Systems >Improving Book OCR by Adaptive Language and Image Models
【24h】

Improving Book OCR by Adaptive Language and Image Models

机译:通过自适应语言和图像模型改进书籍OCR

获取原文

摘要

In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. We describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large data set of scanned books demonstrate that word error rates can be reduced by 25 percent using this approach.
机译:为了应对巨大的书籍内容和字体,对于OCR系统来说,重要的是利用书中的强烈一致性,而是适应书籍的变化。 我们描述了一种使用特定于文档的图像和语言模型组合两个并行校正路径的系统。 每个模型都适应书中的形状和词汇,以确定不一致的校正假设,但依赖于另一个用于有效的交叉验证。 使用开源TESSERACT引擎作为基线,结果在大型数据集的扫描书籍上表明,使用这种方法可以减少25%的错误误差率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号