首页> 外文会议>International conference on computational processing of portuguese >The Other C: Correcting OCR Words in the Presence of Diacritical Marks
【24h】

The Other C: Correcting OCR Words in the Presence of Diacritical Marks

机译:另一个C:在存在变音标记的情况下更正OCR单词

获取原文

摘要

We propose a lexicon based method whose purpose is correcting a word recognized by an OCR engine (a classifier). This postprocessing method was originally designed to be used for language models that support diacritical marks, such as Portuguese. Since these special marks can be confused with noise by the classifier, wrong predictions can be derived if only the top hypothesis per glyph of the original image is preserved. To cope with this, our method uses a filtering strategy to select the best hypotheses for each glyph, which are used to produce candidate queries. A best query is selected in terms of confidence rate and edit distance to the word. A similarity search method over the best query suggests a correction. Experiments show the method improves prediction accuracy considerably for Portuguese words correction.
机译:我们提出了一种基于词典的方法,其目的是纠正由OCR引擎(分类器)识别的单词。此后处理方法最初设计为用于支持变音标记的语言模型,例如葡萄牙语。由于分类器会将这些特殊标记与噪声混淆,因此,如果仅保留原始图像的每个字形的最高假设,则可能会得出错误的预测。为了解决这个问题,我们的方法使用过滤策略为每个字形选择最佳假设,这些假设用于产生候选查询。根据置信度和与单词的编辑距离选择最佳查询。针对最佳查询的相似性搜索方法建议进行更正。实验表明,该方法大大提高了葡萄牙语单词校正的预测准确率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号