【24h】

Printed Text Image Database for Sindhi OCR

机译:信德OCR的印刷文本图像数据库

获取原文
获取原文并翻译 | 示例
       

摘要

Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well as their retrieval. Research on most of the noncursive scripts (Latin) has matured, whereas research on the cursive (connected) scripts is still moving toward perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting the Arabic script; it contains 52 characters, compared to 28 characters in the original Arabic alphabet, in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development, a database is needed for training and testing of Sindhi text images. We have developed a large database containing over 4 billion words and 15 billion characters in 150 various fonts in four font weights and four styles. The database contents were collected from various sources including websites, books, and theses. A custom-built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers words, characters, characters with spaces, and lines. The database is freely available as a partial or full database by sending an email to one of the authors.
机译:文档图像理解(DIU)和电子文档管理是活跃的研究领域,涉及图像理解,解释,有效处理,文档路由以及它们的检索。对大多数非草稿脚本(拉丁文)的研究已经成熟,而对草书(连接)脚本的研究仍在走向完美。目前,世界各地许多研究人员正在研究草书(阿拉伯文和其他采用它的草书),以便可以克服文档理解和处理这些草书时遇到的困难和挑战。在采用阿拉伯文字的语言中,信德文字是原始阿拉伯字母最大的扩展;它包含52个字符,而最初的阿拉伯字母为28个字符,以便容纳该语言的更多声音。有24个区分字符,有些具有四个点。对于Sindhi OCR的研究和开发,需要一个数据库来训练和测试Sindhi文本图像。我们已经开发了一个大型数据库,其中包含150种不同字体的40亿个单词和150亿个字符,四种字体粗细和四种样式。数据库内容是从各种来源收集的,包括网站,书籍和论文。还开发了定制应用程序,以从支持各种字体和大小的文本文档创建文本图像。数据库考虑单词,字符,带空格的字符和行。通过向一位作者发送电子邮件,该数据库可以部分或全部免费使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号