Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model

Elanwar Randa; Qin Wenda; Betke Margrit; Wijaya Derry

首页> 外文期刊>International Journal on Document Analysis and Recognition >Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model

【24h】

Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model

机译：从扫描阿拉伯文书籍中提取文本：大规模的基准数据集和微调更快的R-CNN模型

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

机译：迫切需要阿拉伯语文件数据集，以促进解决语言细节的计算机视觉和自然语言处理研究。不幸的是，公开的阿拉伯语数据集的尺寸限制，限制在某些文档域中。本文提出了Be-Arabic-9k的释放，来自700多本阿拉伯文书籍的9000多种高质量的扫描图像数据集。其中，1500个图像已被手动分段为区域并由其功能标记。 Be-Arabic-9k包括具有各种复杂布局和页面内容的书籍页面，适用于各种文档布局分析和文本识别研究任务。本文还介绍了基于微调更快的R-CNN结构（FFRA）的页面布局分割和文本提取基线模型。该基线模型产生的交叉验证结果，平均精度为99.4％，F1分数为99.1％，对Be-Arab-9K的1500个注释图像上的文本与非文本块分类。这些结果比最先进的阿拉伯语书籍页面细分系统ECDP更好。在竞争基准数据集上测试时，FFRA还优于其他三个现有系统，使其成为挑战的优秀基线模型。

著录项

来源
《International Journal on Document Analysis and Recognition》 |2021年第4期|349-362|共14页
作者
Elanwar Randa; Qin Wenda; Betke Margrit; Wijaya Derry;
展开▼
作者单位

Elect Res Inst Comp & Syst Dept Cairo Egypt;

Boston Univ Boston MA 02215 USA;

Boston Univ Boston MA 02215 USA;

Boston Univ Dept Comp Sci 111 Cummington St Boston MA 02215 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Open Datasets and Tools for Arabic Text Detection and Recognition in News Video Frames [J] . Oussama Zayene, Sameh Masmoudi Touj, Jean Hennebert, Journal of Imaging . 2018,第2期

机译：用于新闻视频帧中阿拉伯文本检测和识别的开放数据集和工具
2. AUTNT - A component level dataset for text non-text classification and benchmarking with novel script invariant feature descriptors and D-CNN [J] . Khan Tauseef, Mollah Ayatullah Faruk Multimedia Tools and Applications . 2019,第22期

机译：AUTNT-用于文本非文本分类和基准测试的组件级数据集，具有新颖的脚本不变特征描述符和D-CNN
3. A Novel Coding and Discrimination (CODIS) Algorithm to Extract Features from Arabic Texts to Discriminate Arabic Poems [J] . Nada Ahmed J., Abdul Monem S. Rahma, Maha A. Hmmood Alrawi International journal of advanced pervasive and ubiquitous computing . 2019,第1期

机译：一种从阿拉伯文本中提取特征来区分阿拉伯诗歌的新型编码和歧视（CODIS）算法
4. Adoptive Thresholding and Geometric Features based Physical Layout Analysis of Scanned Arabic Books [C] . Maitham A Al-Dobais, Fahad Abdulrahman G Alrasheed, Ghazanfar Latif, 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition . 2018

机译：基于门槛和几何特征的扫描阿拉伯图书物理布局分析
5. Modeling economic and financial behavior from large-scale datasets. [D] . Mao, Huina. 2014

机译：从大规模数据集中对经济和金融行为进行建模。
6. SANAD: Single-label Arabic News Articles Dataset for automatic text categorization [O] . Omar Einea, Ashraf Elnagar, Ridhwan Al Debsi 2019

机译：SANAD：用于自动文本分类的单标签阿拉伯新闻文章数据集
7. Arabic Database for Automatic Printed Arabic Text Recognition Research and Benchmarking [O] . Al-Hashim Amin Ghalib S. 2009

机译：阿拉伯数据库自动印刷阿拉伯语文本识别研究和基准测试

Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model

摘要

著录项

相似文献

相关主题

期刊订阅