首页> 外文会议>International Conference on Document Analysis and Recognition >Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach
【24h】

Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach

机译:对可搜索的数字Urdu库 - 一种基于词的检索方法

获取原文

摘要

Libraries in South Asia hold huge collections of valuable printed documents in Urdu and it is of interest to digitize these collections to make them more accessible. The unavailability of an OCR for Urdu however limits the concept of a digital Urdu library to scanning of documents only, offering very limited search facility based on manually assigned tags. We address this issue by proposing a word spotting based keyword search method for information retrieval in digitized collections of printed Urdu documents. The proposed method is based on segmentation of Urdu text in to partial words and representing each partial word by a set of features. To search a specific word (or phrase), the user provides a query in the form of an image. Comparing the features of the partial words in the query image with the ones already indexed, the user is provided with a list of documents containing occurrences of the queried word. The system evaluated on 50 Urdu documents exhibited a recall of 95.17% and a precision of 94.3%.
机译:南亚图书馆在Urdu举办了巨大的有价值的印刷文件的巨大系列,对这些收藏品进行了兴趣使它们更容易获得。然而,URDU的OCR的不可用来限制了数字URDU库的概念,仅限于扫描文档,基于手动分配的标签提供非常有限的搜索功能。我们通过提出基于Word Spotting的关键字搜索方法来解决此问题,用于在打印的URDU文档的数字化集合中检索的信息检索。所提出的方法基于URDU文本的分割,以部分单词,并通过一组特征表示每个部分单词。为了搜索特定的单词(或短语),用户以图像的形式提供查询。使用已经索引的查询图像中的查询图像中的部分单词的特征进行比较,用户被提供有包含查询字的发生的文档列表。在50乌尔都语文件中评估的系统表现出95.17%的召回,精度为94.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号