首页> 外文会议>IAPR International Conference on Document Analysis and Recognition >Convolutional Neural Networks for Figure Extraction in Historical Technical Documents
【24h】

Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

机译:历史技术文件中用于图形提取的卷积神经网络

获取原文

摘要

We present a method of extracting figures and images from the pages of scanned documents, especially from technical research articles. Our approach is novel in two key ways. First, we treat this as a computer vision problem, and train convolutional neural networks to recognize figures in scanned pages. Second, we generate our training data from 'born-digital' structured documents, allowing us to automatically produce labels for our training set using PDF figure extractors. This avoids the otherwise tedious task of hand-labelling thousands of document pages. Our convolutional neural networks achieve precision and recall of close to 85% in identifying figures from a test set consisting of modern journal papers and conference proceedings, and obtain precision and recall above 80% on an application data set comprised of historical technical documents scanned from the Bell Labs Records. Our results show that models trained on digital documents transfer very well to historical scans. Finally, it is easy to extend our models to identify other document elements such as tables and captions.
机译:我们提出了一种从扫描的文档页面(尤其是技术研究文章)中提取图形和图像的方法。我们的方法在两个关键方面是新颖的。首先,我们将其视为计算机视觉问题,并训练卷积神经网络以识别扫描页面中的图形。其次,我们从“数字化”结构化文档中生成训练数据,从而使我们能够使用PDF图形提取器为训练集自动生成标签。这避免了手动标记数千个文档页面的繁琐工作。我们的卷积神经网络在识别包含现代期刊论文和会议论文集的测试集中的图形时,可以达到近85%的精度和查全率,而在包含扫描过的历史技术文档的应用程序数据集上,可以达到80%以上的精度和查全率从贝尔实验室唱片。我们的结果表明,在数字文档上训练的模型可以很好地转移到历史扫描中。最后,很容易扩展我们的模型以识别其他文档元素,例如表格和标题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号