首页> 外文会议>International Conference on Pattern Recognition >Historical document digitization through layout analysis and deep content classification
【24h】

Historical document digitization through layout analysis and deep content classification

机译:通过布局分析和深度内容分类将历史文档数字化

获取原文

摘要

Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the “Enciclopedia Treccani”, a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.
机译:在创建数字化文档集合时,尤其是在处理历史文档时,文档布局的分割和识别是一项重要的任务。本文提出了一种用于布局分割的混合方法,以及一种用于对文档区域进行分类的策略,该策略被应用于历史百科全书的数字化过程。我们的布局分析方法将经典的自上而下方法和自下而上的分类过程基于局部几何特征进行了合并,而区域则是根据从在随机森林分类器中合并的卷积神经网络提取的特征进行分类的。实验是在“ Enciclopedia Treccani”的第一卷上进行的,“ Enciclopedia Treccani”是一个大型数据集,其中包含999个来自意大利历史百科全书的手动注释页面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号