首页> 外文会议>Document recognition and retrieval XX >WFST-based Ground Truth Alignment for Difficult Historical Documents with Text Modification and Layout Variations

【24h】

WFST-based Ground Truth Alignment for Difficult Historical Documents with Text Modification and Layout Variations

机译：基于WFST的难易历史文档的地面真相对齐，具有文本修改和布局变化

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This work proposes several approaches that can be used for generating correspondences between real scanned books and their transcriptions which might have different modifications and layout variations, also taking OCR errors into account. Our approaches for the alignment between the manuscript and the transcription are based on weighted finite state transducers (WFST). In particular, we propose adapted WFSTs to represent the transcription to be aligned with the OCR lattices. The character-level alignment has edit rules to allow edit operations (insertion, deletion, substitution). Those edit operations allow the transcription model to deal with OCR segmentation and recognition errors, and also with the task of aligning with different text editions. We implemented an alignment model with a hyphenation model, so it can adapt the non-hyphenated transcription. Our models also work with Fraktur ligatures, which are typically found in historical Fraktur documents. We evaluated our approach on Fraktur documents from "Wanderungen durch die Mark Brandenburg" volumes (1862-1889) and observed the performance of those models under OCR errors. We compare the performance of our model for three different scenarios: having no information about the correspondence at the word (i), line (ii), sentence (iii) or page (iv) level.

机译：这项工作提出了几种方法，可用于在实际扫描的书籍及其抄本之间生成对应关系，这些方法可能会有不同的修改和版式变化，同时也考虑了OCR错误。我们的手稿和抄本对齐方式基于加权有限状态传感器（WFST）。特别是，我们提出了经过修改的WFST，以表示与OCR晶格对齐的转录。字符级对齐具有编辑规则，以允许进行编辑操作（插入，删除，替换）。这些编辑操作允许转录模型处理OCR分割和识别错误，还可以处理与不同文本版本保持一致的任务。我们用连字模型实现了比对模型，因此它可以适应非连字的转录。我们的模型还可以使用Fraktur连字，通常在历史Fraktur文档中可以找到它们。我们从“ Wanderungen durch die Mark Brandenburg”卷（1862-1889）中的Fraktur文档中评估了我们的方法，并观察了这些模型在OCR错误下的性能。我们比较了三种情况下模型的性能：在单词（i），行（ii），句子（iii）或页面（iv）上没有有关对应关系的信息。

著录项

来源
《Document recognition and retrieval XX》|2013年|865818.1-865818.12|共12页
会议地点 Burlingame CA(US)
作者
Mayce Al Azawi; Marcus Liwicki; Thomas M Breuel;
展开▼
作者单位

Department of Computer Science, Technical University of Kaiserslautern, D-67663 Kaiserslautern, Germany;

German Research Center for Artificial Intelligence (DFKI), D-67663 Kaiserslautern, Germany;

Department of Computer Science, Technical University of Kaiserslautern, D-67663 Kaiserslautern, Germany;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Alignment; Imperfect Transcription; Historical Documents; WSFT; OCR; Hyphenation;

机译：对准;转录不完善；历史文献； WSFT； OCR；断字;

相似文献

外文文献
中文文献
专利

1. The use of Gabor features for semi-automatically generated polyon-based ground truth of historical document images [J] . Wei Hao, Seuret Mathias, Liwicki Marcus, Literary & linguistic computing . 2017,第aprasuppla1期

机译：使用Gabor功能半自动生成基于Polyon的历史文档图像地面真实情况
2. Text/Image Region Separation for Document Layout Detection of Old Document Images Using Non-linear Diffusion and Level Set [J] . S. Sachin Kumar, Parvathy Rajendran, P. Prabaharan, Procedia Computer Science . 2016,第1期

机译：文本/图像区域分离，用于使用非线性扩散和水平集的旧文档图像的文档布局检测
3. Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground Truth [J] . Kripabandhu Ghosh, Anirban Chakraborty, Swapan Kumar Parui, Information Processing & Management . 2016,第5期

机译：缺少纯文本基础的情况下提高OCRed文本的信息检索性能
4. WFST-based ground truth alignment for difficult historical documents with text modification and layout variations [C] . Mayce Al Azawi, Marcus Liwicki, Thomas M Breuel SPIE Conference on Document Recognition and Retrieval . 2013

机译：基于WFST的地面真理对齐困难的历史文档，文本修改和布局变化
5. Data mining approach for recognizing layout and querying plain text documents [D] . Manrao, Manu 2007

机译：识别布局和查询纯文本文档的数据挖掘方法
6. Enhancing Situational Awareness by Ground Truthing with Historical Outbreaks [O] . Lauren Castro, Kirsten Taylor-McCabe, Eric Generous, 2014

机译：通过地面真相与历史性事件的爆发来增强态势感知
7. A Tool for Ground-Truthing Text Lines and Characters in Off-Line Handwritten Chinese Documents [O] . Fei Yin, Qiu-feng Wang, Cheng-lin Liu 2009

机译：离线手写中文文档中用于实线文本行和字符的工具

WFST-based Ground Truth Alignment for Difficult Historical Documents with Text Modification and Layout Variations

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅