首页> 外文会议>International Conference on Frontiers in Handwriting Recognition >Two Semi-Supervised Training Approaches for Automated Text Recognition
【24h】

Two Semi-Supervised Training Approaches for Automated Text Recognition

机译:两种自动文本识别的半监督训练方法

获取原文

摘要

Automated text recognition is a fundamental problem in Document Image Analysis. Optical models are used for modeling characters while language models are used for composing sentences. Since the scripts and linguistic context differ widely, it is mandatory to specialize the models by training on task-dependent ground-truth. However, to create a sufficient amount of ground-truth, at least for historical handwritten scripts, well-qualified persons have to mark and transcribe text lines, which is very time-consuming. On the other hand, in many cases unassigned transcripts are already available on page-level from another process chain, or at least transcripts from similar linguistic context are available. In this work we present two approaches that make use of such transcripts: whereas the first one creates training data by automatically assigning page-dependent transcripts to text lines, the second one uses a task-specific language model to generate highly confident training data. Both approaches are successfully applied on a very challenging historical handwritten collection.
机译:文本自动识别是“文档图像分析”中的一个基本问题。光学模型用于建模字符,而语言模型用于构成句子。由于脚本和语言环境差异很大,因此必须通过对与任务相关的事实进行培训来对模型进行专业化处理。但是,至少要为历史手写脚本创建足够多的依据,合格的人员必须标记和抄写文本行,这非常耗时。另一方面,在许多情况下,未分配的成绩单已经在另一个处理链的页面级别上可用,或者至少来自相似语言环境的成绩单是可用的。在这项工作中,我们提出了两种利用此类成绩单的方法:第一种方法是通过自动将与页面相关的成绩单分配给文本行来创建训练数据,而第二种方法则使用特定于任务的语言模型来生成高度自信的训练数据。两种方法都成功地应用于极具挑战性的历史手写收藏中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号