首页> 外文会议>Document recognition and retrieval XX >WFST-based Ground Truth Alignment for Difficult Historical Documents with Text Modification and Layout Variations
【24h】

WFST-based Ground Truth Alignment for Difficult Historical Documents with Text Modification and Layout Variations

机译:基于WFST的难易历史文档的地面真相对齐,具有文本修改和布局变化

获取原文
获取原文并翻译 | 示例

摘要

This work proposes several approaches that can be used for generating correspondences between real scanned books and their transcriptions which might have different modifications and layout variations, also taking OCR errors into account. Our approaches for the alignment between the manuscript and the transcription are based on weighted finite state transducers (WFST). In particular, we propose adapted WFSTs to represent the transcription to be aligned with the OCR lattices. The character-level alignment has edit rules to allow edit operations (insertion, deletion, substitution). Those edit operations allow the transcription model to deal with OCR segmentation and recognition errors, and also with the task of aligning with different text editions. We implemented an alignment model with a hyphenation model, so it can adapt the non-hyphenated transcription. Our models also work with Fraktur ligatures, which are typically found in historical Fraktur documents. We evaluated our approach on Fraktur documents from "Wanderungen durch die Mark Brandenburg" volumes (1862-1889) and observed the performance of those models under OCR errors. We compare the performance of our model for three different scenarios: having no information about the correspondence at the word (i), line (ii), sentence (iii) or page (iv) level.
机译:这项工作提出了几种方法,可用于在实际扫描的书籍及其抄本之间生成对应关系,这些方法可能会有不同的修改和版式变化,同时也考虑了OCR错误。我们的手稿和抄本对齐方式基于加权有限状态传感器(WFST)。特别是,我们提出了经过修改的WFST,以表示与OCR晶格对齐的转录。字符级对齐具有编辑规则,以允许进行编辑操作(插入,删除,替换)。这些编辑操作允许转录模型处理OCR分割和识别错误,还可以处理与不同文本版本保持一致的任务。我们用连字模型实现了比对模型,因此它可以适应非连字的转录。我们的模型还可以使用Fraktur连字,通常在历史Fraktur文档中可以找到它们。我们从“ Wanderungen durch die Mark Brandenburg”卷(1862-1889)中的Fraktur文档中评估了我们的方法,并观察了这些模型在OCR错误下的性能。我们比较了三种情况下模型的性能:在单词(i),行(ii),句子(iii)或页面(iv)上没有有关对应关系的信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号