DocEmul: A Toolkit to Generate Structured Historical Documents

机译：DocEmul：生成结构化历史文档的工具包

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a toolkit to generate structured synthetic documents emulating the actual document production process. Synthetic documents can be used to train systems to perform document analysis tasks. In our case we address the record counting task on handwritten structured collections containing a limited number of examples. Using the DocEmul toolkit we can generate a larger dataset to train a deep architecture to predict the number of records for each page. The toolkit is able to generate synthetic collections and also perform data augmentation to create a larger trainable dataset. It includes one method to extract the page background from real pages which can be used as a substrate where records can be written on the basis of variable structures and using cursive fonts. Moreover, it is possible to extend the synthetic collection by adding random noise, page rotations, and other visual variations. We performed some experiments on two different handwritten collections using the toolkit to generate synthetic data to train a Convolutional Neural Network able to count the number of records in the real collections.

机译：我们建议使用一个工具包来生成模拟实际文档制作过程的结构化综合文档。合成文档可用于训练系统以执行文档分析任务。在我们的案例中，我们处理包含有限数量示例的手写结构化集合上的记录计数任务。使用DocEmul工具包，我们可以生成更大的数据集，以训练深度架构来预测每个页面的记录数。该工具包能够生成综合集合，还可以执行数据扩充以创建更大的可训练数据集。它包括一种从真实页面中提取页面背景的方法，该页面可以用作底物，在该底物上可以基于可变结构并使用草书字体写入记录。此外，可以通过添加随机噪声，页面旋转和其他视觉变化来扩展合成集合。我们使用工具包对两个不同的手写集合进行了一些实验，以生成合成数据来训练卷积神经网络，该网络能够计算实际集合中的记录数。

著录项

来源
《IAPR International Conference on Document Analysis and Recognition》|2017年|1186-1191|共6页
会议地点
作者
Samuele Capobianco; Simone Marinai;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Task analysis; Tools; Unified modeling language; Training; Dictionaries; Text analysis;

机译：任务分析;工具;统一建模语言;培训;词典;文本分析;

相似文献

外文文献
中文文献
专利

1. The use of Gabor features for semi-automatically generated polyon-based ground truth of historical document images [J] . Wei Hao, Seuret Mathias, Liwicki Marcus, Literary & linguistic computing . 2017,第aprasuppla1期

机译：使用Gabor功能半自动生成基于Polyon的历史文档图像地面真实情况
2. FRACTURE mining: Mining frequently and concurrently mutating structures from historical XML documents [J] . Ling Chen, Sourav S. Bhowmick, Liang-Tien Chia Data & Knowledge Engineering . 2006,第2期

机译：断裂挖掘：频繁并同时从历史XML文档中挖掘结构的挖掘
3. Dynamically generating T32 training documents using structured data [J] . Paul James Albert, Ayesha Joshi Journal of the Medical Library Association : . 2019,第3期

机译：使用结构化数据动态生成T32培训文档
4. DocEmul: A Toolkit to Generate Structured Historical Documents [C] . Samuele Capobianco, Simone Marinai IAPR International Conference on Document Analysis and Recognition . 2017

机译：Docemul：一个生成结构化历史文档的工具包
5. Generating An Overview Report of Multilevel Structure over A Large Corpus of Documents [D] . Wang, Jingwen. 2019

机译：通过大型文档语料库生成多级结构的概述报告
6. Dynamically generating T32 training documents using structured data [O] . Paul James Albert, Ayesha Joshi 2019

机译：使用结构化数据动态生成T32培训文档
7. DocEmul: a Toolkit to Generate Structured Historical Documents [O] . Capobianco, Samuele, Marinai, Simone 2017

机译：DocEmul：生成结构化历史文档的工具包

DocEmul: A Toolkit to Generate Structured Historical Documents

摘要

著录项

相似文献

相关主题

期刊订阅