首页> 外文会议> >The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition
【24h】

The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition

机译:一样不一样-混合字母OCR识别中字母混乱错误的后校正

获取原文

摘要

Character sets for Eastern European languages typically contain symbols that are optically almost or fully identical to Latin letters. When scanning documents with mixed Cyrillic-Latin or Greek-Latin alphabets, even high-quality OCR-software is often not able to correctly separate between Cyrillic (Greek) and Latin symbols. This effect leads to an error rate that is far beyond the usual error rates observed when recognizing single-alphabet documents. In this paper we first survey similarities between Latin and Cyrillic (Greek) letters and words for distinct languages and fonts. After briefly introducing a new and public corpus collected by our groups for evaluating OCR-technology over mixed-alphabet documents, we describe how to adapt general algorithms and tools for postcorrection of OCR results to the new context of mixed-alphabet recognition. Experimental results on Bulgarian documents from the corpus and from other sources demonstrate that a drastic reduction of error rates can be achieved.
机译:东欧语言的字符集通常包含与拉丁字母在光学上几乎或完全相同的符号。当扫描使用西里尔字母-拉丁字母或希腊字母-拉丁字母混合的文档时,即使高质量的OCR软件也常常无法正确区分西里尔字母(希腊文)和拉丁文符号。这种影响导致错误率远远超过识别单个字母文档时观察到的常见错误率。在本文中,我们首先调查拉丁文和西里尔文(希腊文)字母和单词在不同语言和字体之间的相似性。在简要介绍了我们的小组收集的用于评估混合字母文档的OCR技术的新的公共语料库之后,我们描述了如何将用于OCR结果后校正的通用算法和工具适应于混合字母识别的新背景。来自保加利亚语料库和其他来源的保加利亚文献的实验结果表明,可以大大降低错误率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号