首页> 外文期刊>International Journal on Document Analysis and Recognition >Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
【24h】

Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model

机译:从扫描阿拉伯文书籍中提取文本:大规模的基准数据集和微调更快的R-CNN模型

获取原文
获取原文并翻译 | 示例
       

摘要

Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.
机译:迫切需要阿拉伯语文件数据集,以促进解决语言细节的计​​算机视觉和自然语言处理研究。不幸的是,公开的阿拉伯语数据集的尺寸限制,限制在某些文档域中。本文提出了Be-Arabic-9k的释放,来自700多本阿拉伯文书籍的9000多种高质量的扫描图像数据集。其中,1500个图像已被手动分段为区域并由其功能标记。 Be-Arabic-9k包括具有各种复杂布局和页面内容的书籍页面,适用于各种文档布局分析和文本识别研究任务。本文还介绍了基于微调更快的R-CNN结构(FFRA)的页面布局分割和文本提取基线模型。该基线模型产生的交叉验证结果,平均精度为99.4%,F1分数为99.1%,对Be-Arab-9K的1500个注释图像上的文本与非文本块分类。这些结果比最先进的阿拉伯语书籍页面细分系统ECDP更好。在竞争基准数据集上测试时,FFRA还优于其他三个现有系统,使其成为挑战的优秀基线模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号