首页> 外文会议>International Conference on Speech and Computer >Generation of Synthetic Images of Full-Text Documents
【24h】

Generation of Synthetic Images of Full-Text Documents

机译:全文文件的综合图像的产生

获取原文

摘要

In this paper, we present an algorithm for generating images of full-text documents. Such images can be used to train and evaluate models of optical character recognition. The algorithm is modular, individual parts can be changed and tweaked to generate desired images. We describe a method for obtaining background images of paper from already digitalized documents. We use a Variational Autoencoder to train a generative model of these backgrounds enabling the generation of similar background images as the training ones on the fly. The module for printing the text uses large text corpora, font, and suitable positional and brightness noise to obtain believable results. We use Tesseract OCR to compare the real world and generated images and observe that the recognition rate is very similar indicating the proper appearance of the synthetic images. Furthermore, the mistakes made by the OCR system in both cases are alike. Finally, the system generates detailed, structured annotation of the synthesized image.
机译:在本文中,我们提出了一种用于生成全文文档图像的算法。这些图像可用于训练和评估光学字符识别的模型。算法是模块化的,可以改变各个部件并调整以产生所需的图像。我们描述了一种从已经数字化文档获得纸张背景图像的方法。我们使用变形式AutoEncoder来培训这些背景的生成模型,使得类似背景图像的产生作为训练。打印文本的模块使用大型文本语料库,字体和合适的位置和亮度噪声来获得可信结果。我们使用TESSERACT OCR来比较现实世界和生成的图像并观察到识别率非常相似,表明合成图像的适当外观。此外,OCR系统在这两种情况下的错误都是相似的。最后,系统生成合成图像的详细,结构化注释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号