Language Independent Word Spotting in Scanned Documents

机译：扫描文档中与语言无关的单词识别

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Large quantities of scanned handwritten and printed documents are rapidly being made available for use by information storage and retrieval systems, such as for use by libraries. We present the design and performance of a language independent system for spotting handwritten/printed words in scanned document images. The technique is evaluated with three scripts: Devanagari (Sanskrit/Hindi), Arabic (Arabic/Urdu) and Latin (English). Three main components of the system are a word segmenter, a shape based matcher for words, and a search interface. The user gives a query which can be (ⅰ) A word image (to spot similar words from a collection of documents written in that script) or (ⅱ) text (to look for the equivalent word images in the script). The candidate words that are searched in the documents are retrieved and ranked, where the ranking criterion is a similarity score between the query and the candidate words based on global word shape features. For handwritten English, a precision of 60% was obtained at a recall of 50%. An alternate approach comprising of prototype selection and word matching, that yields a better performance for handwritten documents is also discussed. For printed Sanskrit documents, a precision as high as 90% was obtained at a recall of 50%.

机译：大量扫描的手写和打印文档正迅速可供信息存储和检索系统使用，例如供图书馆使用。我们提出了一种独立于语言的系统的设计和性能，该系统可用于在扫描的文档图像中发现手写/打印的单词。使用三种脚本评估了该技术：梵文（梵文/印地文），阿拉伯文（阿拉伯文/乌尔都文）和拉丁文（英文）。该系统的三个主要组件是分词器，基于形状的单词匹配器和搜索界面。用户给出的查询可以是（ⅰ）单词图像（从该脚本编写的文档集中发现相似的单词）或（ⅱ）文本（在脚本中查找等效的单词图像）。在文档中搜索的候选单词被检索并排序，其中排名标准是基于全局单词形状特征的查询和候选单词之间的相似度得分。对于手写英语，召回率为50％时，精度为60％。还讨论了一种替代方法，其中包括原型选择和单词匹配，可以为手写文档带来更好的性能。对于打印的梵文文档，召回率为50％时，可以达到90％的精度。

著录项

来源
《Digital Libraries: Universal and Ubiquitous Access to Information》|2008年|134-143|共10页
会议地点 Bali(ID);Bali(ID)
作者
Sargur N. Srihari; Gregory R. Ball;
展开▼
作者单位

Center of Excellence for Document Analysis and Recognition (CEDAR) Department of Computer Science and Engineering University at Buffalo, The State University of New York Buffalo, New York 14228, USA;

Center of Excellence for Document Analysis and Recognition (CEDAR) Department of Computer Science and Engineering University at Buffalo, The State University of New York Buffalo, New York 14228, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词

相似文献

外文文献
中文文献
专利

1. Statistical script independent word spotting in offline handwritten documents [J] . Safwan Wshah, Gaurav Kumar, Venu Govindaraju Pattern Recognition: The Journal of the Pattern Recognition Society . 2014,第3期

机译：统计脚本独立单词脱机手写文档中的单词发现
2. From the Viking Word-Hoard: A Dictionary of Scandinavian Words in the Languages of Britain and Ireland, by Diarmaid Ó Muirithe [J] . Sara M. Pons-Sanz English Historical Review, The . 2012,第524期

机译：摘自《维京人的语言宝藏：不列颠和爱尔兰语言的斯堪的纳维亚单词词典》，DiarmaidÓMuirithe着
3. Segmentation Free Word Spotting for Handwritten Documents Using Bag of Visual Words Based on Co-HOG Descriptor [J] . Prabhakar C. J., Thontadari C. International journal of information retrieval research . 2019,第2期

机译：基于Co-HOG描述符的视觉词袋对手写文档的自由分割
4. Language Independent Word Spotting in Scanned Documents [C] . Sargur N. Srihari, Gregory R. Ball International Conference on Asian Digital Libraries . 2008

机译：语言独立单词在扫描文档中发现
5. Connecting Documents, Words, and Languages Using Topic Models [D] . Yang, Weiwei. 2019

机译：使用主题模型连接文档，单词和语言
6. Brain-Based Translation: fMRI Decoding of Spoken Words in Bilinguals Reveals Language-Independent Semantic Representations in Anterior Temporal Lobe [O] . João Correia, Elia Formisano, Giancarlo Valente, 2014

机译：基于大脑的翻译：双语者口语单词的fMRI解码揭示了颞叶前叶的独立于语言的语义表示
7. Keyword Extraction using the Word Co-occurrence Network Properties that is Independent of Languages and Document Types and Its Evaluation by Prediction of Headline Words [O] . Yuki YAMAMOTO, Ryohei ORIHARA 2009

机译：关键字提取使用与语言和文档类型无关的单词共同发生网络属性及其通过预测标题字的评估

Language Independent Word Spotting in Scanned Documents

摘要

著录项

相似文献

相关主题

期刊订阅