DocEmul: a Toolkit to Generate Structured Historical Documents

机译：Docemul：一个生成结构化历史文档的工具包

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a toolkit to generate structured synthetic documents emulating the actual document production process. Synthetic documents can be used to train systems to perform document analysis tasks. In our case we address the record counting task on handwritten structured collections containing a limited number of examples. Using the DocEmul toolkit we can generate a larger dataset to train a deep architecture to predict the number of records for each page. The toolkit is able to generate synthetic collections and also perform data augmentation to create a larger trainable dataset. It includes one method to extract the page background from real pages which can be used as a substrate where records can be written on the basis of variable structures and using cursive fonts. Moreover, it is possible to extend the synthetic collection by adding random noise, page rotations, and other visual variations. We performed some experiments on two different handwritten collections using the toolkit to generate synthetic data to train a Convolutional Neural Network able to count the number of records in the real collections.

机译：我们提出了一种工具包，可以生成模拟实际文档生产过程的结构化合成文件。合成文档可用于培训系统以执行文档分析任务。在我们的例子中，我们在包含有限数量的示例上的手写结构集合上的记录计数任务。使用Docemul Toolkit，我们可以生成更大的数据集以培训深度架构以预测每个页面的记录数。该工具包能够生成合成集合，并执行数据增强以创建更大的培训数据集。它包括从真实页面中提取页面背景的一种方法，该方法可以用作基板，其中可以基于可变结构和使用法学字体编写记录。此外，可以通过添加随机噪声，页面旋转和其他视觉变化来扩展合成集合。我们使用工具包对两个不同的手写集合进行了一些实验，以生成合成数据，以培训能够计算真实集合中的记录数量的卷积神经网络。

著录项

来源
《IAPR International Conference on Document Analysis and Recognition》|2017年|733-1472p|共6页
会议地点
作者
Samuele Capobianco; Simone Marinai;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41-53;
关键词
Synthetic Document Generation; Historical Documents; Data Augmentation; Deep Learning;

机译：合成文件生成;历史文件;数据增强;深入学习;

相似文献

外文文献
中文文献
专利

1. The use of Gabor features for semi-automatically generated polyon-based ground truth of historical document images [J] . Wei Hao, Seuret Mathias, Liwicki Marcus, Literary & linguistic computing . 2017,第aprasuppla1期

机译：使用Gabor功能半自动生成基于Polyon的历史文档图像地面真实情况
2. FRACTURE mining: Mining frequently and concurrently mutating structures from historical XML documents [J] . Ling Chen, Sourav S. Bhowmick, Liang-Tien Chia Data & Knowledge Engineering . 2006,第2期

机译：断裂挖掘：频繁并同时从历史XML文档中挖掘结构的挖掘
3. Dynamically generating T32 training documents using structured data [J] . Paul James Albert, Ayesha Joshi Journal of the Medical Library Association : . 2019,第3期

机译：使用结构化数据动态生成T32培训文档
4. DocEmul: A Toolkit to Generate Structured Historical Documents [C] . Samuele Capobianco, Simone Marinai IAPR International Conference on Document Analysis and Recognition . 2017

机译：DocEmul：生成结构化历史文档的工具包
5. Generating An Overview Report of Multilevel Structure over A Large Corpus of Documents [D] . Wang, Jingwen. 2019

机译：通过大型文档语料库生成多级结构的概述报告
6. Dynamically generating T32 training documents using structured data [O] . Paul James Albert, Ayesha Joshi 2019

机译：使用结构化数据动态生成T32培训文档
7. DocEmul: a Toolkit to Generate Structured Historical Documents [O] . Capobianco, Samuele, Marinai, Simone 2017

机译：DocEmul：生成结构化历史文档的工具包

DocEmul: a Toolkit to Generate Structured Historical Documents

摘要

著录项

相似文献

相关主题

期刊订阅