Tools for Semi-automatic Preparation of Training Data for OCR

机译：半自动准备OCR训练数据的工具

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This work aims at data preparation for OCR systems based on recurrent neural networks. Precisely annotated data are necessary for training a network as well as for evaluation of OCR methods. It is possible to synthesize the data, however such data are not that realistic as the real ones. Manual annotation is thus still needed in many cases, especially in the case of historical documents we are focusing on. Although there are several complex systems for historical document processing, to the best of our knowledge, a simple annotation tool for OCR data is completely missing. Therefore, we propose and implement a set of tools utilizing artificial intelligence that simplify the annotation process. These tools create ground truths for line images that are used for training of nowadays OCR systems. Another contribution of this paper is making these tools freely available for research purposes.

机译：这项工作旨在为基于递归神经网络的OCR系统准备数据。精确注释的数据对于训练网络以及评估OCR方法是必需的。可以合成数据，但是这些数据并不像真实数据那样真实。因此，在许多情况下，尤其是在我们关注的历史文献中，仍然需要手动注释。尽管有多个用于历史文档处理的复杂系统，但据我们所知，完全没有用于OCR数据的简单注释工具。因此，我们提出并实施了一套利用人工智能简化注释过程的工具。这些工具为用于当今OCR系统训练的线图像创建了基本事实。本文的另一个贡献是免费提供这些工具用于研究目的。

著录项

来源
《IFIP WG 12.5 International workshops on artificial intelligence applications and innovations;Mining humanistic data workshop;Workshop on 5g-putting intelligence to the network edge;Workshop on emerging trends in AI》|2019年|351-361|共11页
会议地点 Hersonissos(GR)
作者
Ladislav Lenc; Jiří Martínek; Pavel Král;
展开▼
作者单位

Department of Computer Science and Engineering Faculty of Applied Sciences University of West Bohemia Plzeň Czech Republic NITS - New Technologies for the Information Society Faculty of Applied Sciences University of West Bohemia Plzeň Czech Republic;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
CNN; Historical documents; LSTM; Neural networks; OCR;

机译：CNN;历史文件； LSTM；神经网络;光学字符识别;
入库时间 2022-08-26 14:42:02

相似文献

外文文献
中文文献
专利

1. Developing a semi-automatic data conversion tool for Korean ecological data standardization [J] . Hyeonjeong Lee, Hoseok Jung, Miyoung Shin, Journal of ecology and environment. . 2017,第1期

机译：开发用于韩国生态数据标准化的半自动数据转换工具
2. Semi-Automatic Labeling of Training Data Sets in Text Classification [J] . Nayereh Ghahreman, Ahmad Baraani Dastjerdi Computer and information science . 2011,第6期

机译：文本分类中训练数据集的半自动标记
3. Semi-Automatic Labeling of Training Data Sets in Text Classification [J] . Nayereh Ghahreman, Ahmad Baraani Dastjerdi Computer and Information Science . 2011,第6期

机译：文本分类中训练数据集的半自动标记
4. Tools for Semi-automatic Preparation of Training Data for OCR [C] . Ladislav Lenc, Ji?í Martínek, Pavel Král IFIP WG 12.5 International workshops on artificial intelligence applications and innovations . 2019

机译：用于半自动准备OCR培训数据的工具
5. Optimal algorithms for L1-norm Principal Component Analysis: New tools for signal processing and machine learning with few and/or faulty training data. [D] . Markopoulos, Panagiotis. 2015

机译：L1-norm主成分分析的最佳算法：信号处理和机器学习的新工具，培训数据很少和/或有问题。
6. Machine-learning based segmentation of the optic nerve head using multi-contrast Jones matrix optical coherence tomography with semi-automatic training dataset generation [O] . Deepa Kasaragod, Shuichi Makita, Young-Joo Hong, 2018

机译：使用多对比度琼斯矩阵光学相干断层扫描和半自动训练数据集生成的基于机器学习的视神经头部分割
7. Improving chunker performance using a web-based semi-automatic training data analysis tool [O] . Endrédy István 2015

机译：使用基于Web的半自动训练数据分析工具提高分块器的性能

Tools for Semi-automatic Preparation of Training Data for OCR

摘要

著录项

相似文献

相关主题

期刊订阅