Enabling Search over Large Collections of Telugu Document Images - An Automatic Annotation Based Approach

机译：启用对泰卢固语文档图像大集合的搜索-一种基于自动注释的方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances the state-of-the-art on multiple fronts: ⅰ) Indian language document images are made searchable by textual queries, ⅱ) interactive content-level access is provided to document images for search and retrieval, ⅲ) a novel recognition-free approach, that does not require an OCR, is adapted and validated ⅳ) a suite of image processing and pattern classification algorithms are proposed to efficiently automate the process and ⅴ) the scalability of the solution is demonstrated over a large collection of 500 digitised books consisting of 75,000 pages.rnCharacter recognition based approaches yield poor results for developing search engines for Indian language document images, due to the complexity of the script and the poor quality of the documents. Recognition free approaches, based on word-spotting, are not directly scalable to large collections, due to the computational complexity of matching images in the feature space. For example, if it requires 1 mSec to match two images, the retrieval of documents to a single query, from a large collection like ours, would require close to a day's time. In this paper we propose a novel automatic annotation based approach to provide textual description of document images. With a one time, offline computational effort, we are able to build a text-based retrieval system, over annotated images. This system has an interactive response time of about 0.01 second. However, we pay the price in the form of massive offline computation, which is performed on a cluster of 35 computers, for about a month. Our procedure is highly automatic, requiring minimal human intervention.

机译：首次启用了从数字化文档图像中收集的2100万个单词图像的庞大搜索功能。这项工作在多个方面推进了最新技术：ⅰ）印度文字文档图像可通过文本查询进行搜索，ⅱ）提供交互式内容级别的访问以对文档图像进行搜索和检索，ⅲ）新颖的识别一种无需OCR的免费方法，经过了调整和验证（ⅳ）提出了一套图像处理和模式分类算法，以有效地使过程自动化；ⅴ）在500个数字化的大型集合中展示了该解决方案的可扩展性包含75,000页的书籍。rn由于脚本的复杂性和文档质量差，基于字符识别的方法对于开发印度语文档图像的搜索引擎产生的效果不佳。由于在特征空间中匹配图像的计算复杂性，基于单词发现的无识别方法不能直接扩展到大型集合。例如，如果需要1 mSec来匹配两个图像，则从像我们这样的大集合中检索文档到单个查询将需要接近一天的时间。在本文中，我们提出了一种新颖的基于自动注释的方法来提供文档图像的文本描述。通过一次性的离线计算工作，我们能够在带注释的图像上构建基于文本的检索系统。该系统的交互响应时间约为0.01秒。但是，我们以大规模离线计算的形式支付了费用，该计算在35台计算机的群集上执行，为期约一个月。我们的程序是高度自动化的，几乎不需要人工干预。

著录项

来源
《Computer Vision, Graphics and Image Processing; Lecture Notes in Computer Science; 4338》|2006年|837-848|共12页
会议地点 Madurai(IN)
作者
Pramod Sankar K.; C.V. Jawahar;
展开▼
作者单位

Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类信息处理（信息加工）;
关键词

相似文献

外文文献
中文文献
专利

1. Intelligent Search and Automatic Document Classification and Cataloging Based on Ontology Approach [J] . Vyacheslav Lanin, Lyudmila Lyadova International Journal Information Theories and Applications . 2007,第1期

机译：基于本体的智能搜索和自动文档分类编目
2. A Semi-supervised Learning Approach Based on Adaptive Weighted Fusion for Automatic Image Annotation [J] . Li Zhixin, Lin Lan, Zhang Canlong, ACM transactions on multimedia computing communications and applications . 2021,第1期

机译：基于自动图像注释的自适应加权融合的半监督学习方法
3. Multi-label automatic image annotation approach based on multiple improvement strategies [J] . Jin Cong, Jin Shu-Wei Image Processing, IET . 2019,第4期

机译：基于多种改进策略的多标签自动图像标注方法
4. Reverse Annotation Based Retrieval from Large Document Image Collections [C] . Pramod Sankar K. 33rd annual international ACM SIGIR conference on research and development in information retrieval 2010 . 2010

机译：从大文档图像集合中基于反向注释的检索
5. Information extraction to enable faceted search over large text document collections. [D] . Ahmed, Syed Toufeeq. 2010

机译：信息提取可对大型文本文档集进行多面搜索。
6. Automatic medical image annotation and keyword-based image retrieval using relevance feedback [O] . Byoung Chul Ko, JiHyeon Lee, Jae-Yeal Nam 2012

机译：使用相关性反馈自动进行医学图像注释和基于关键字的图像检索
7. Enabling Search over Large Collections of Telugu Document Images – An Automatic Annotation Based Approach [O] . Pramod Sankar K, C. V. Jawahar 2013

机译：启用对大量Telugu文档图像的搜索 - 基于自动注释的方法

Enabling Search over Large Collections of Telugu Document Images - An Automatic Annotation Based Approach

摘要

著录项

相似文献

相关主题

期刊订阅