Content-based handwritten document indexing and retrieval.

机译：基于内容的手写文档索引和检索。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Information retrieval on textual data has been well studied and its applications (such as web searching) have become ubiquitous in our daily lives. However content-based image retrieval on handwritten document collections still remains a challenging problem. Here "content-based" means that the search will analyze the actual content of the images, instead of merely the metadata. In the context of handwritten documents, the word "content" might refer different things, such as writing style, shape of words and characters, or the truth of the writing. Accordingly, two different types of retrieval can be performed: "query by example" and semantic (or "query by text") retrieval. While both of them have their own applications in the real world, the second one is more intuitive and user-friendly, since it uses not only the low level underlying computational features, but also the understanding of documents.; This work explores several automatic techniques to do both types of retrieval upon handwritten document collections. These techniques are three-fold: (i) indexing, (ii) "query by example" retrieval and (iii) "query by text" retrieval.; For indexing, we focus on the problem of word segmentation and transcript mapping. Word segmentation is the task of segmenting text line images into word image, which is one of the most important preprocessing steps in order to perform any word level analysis or recognition. We propose the use of neural network with a new set of global and local features to make the classification between inter-word and intra-word gaps. The transcript mapping problem is an alignment problem between the handwritten document image and its transcript. It is not a trivial task simply because the word segmentation algorithm is error prone. A recognition based dynamic programming algorithm is proposed to solve this problem. It is also shown to improve the accuracy of automatic word segmentation.; In "query by example" retrieval, the query can be either a full page document or a single word image. For the document level retrieval, a statistical model is learned to determine whether the writing styles of two documents are similar or not. Gamma and Gaussian distributions are used for the modeling. Word level retrieval is performed by a feature based similarity search algorithm. For each word image, a 1024-bit binary feature vector is extracted for this purpose. "Query by text" retrieval is a more challenging task because word level segmentation is error prone and word recognition with large lexicon size is still an unsolved problem. The current solution for this problem is to manually annotate the collection, which is costly. By taking the idea from machine translation in textual information retrieval, we propose a statistical approach for word recognition and use the probabilistic annotation results to do language model retrieval on handwritten documents. For all these approaches, their performances are empirically compared on several test collections.; The main contributions of this work are a detailed examination of different levels of content-based image retrieval for handwritten documents, and the development of a retrieval system that allows either image or text queries. The new word segmentation method shows an improved performance over a previous method and is useful in forensic document analysis. In addition, a large handwriting database of 3824 pages (about 573,600 labeled words) was created using the proposed transcript-mapping algorithm. This database was used predominantly in this dissertation and it serves as a useful resource for future handwriting analysis and recognition research.

机译：关于文本数据的信息检索已经得到了很好的研究，其应用（例如网络搜索）在我们的日常生活中已变得无处不在。但是，基于内容的手写文档集图像检索仍然是一个具有挑战性的问题。这里的“基于内容”意味着搜索将分析图像的实际内容，而不仅仅是元数据。在手写文档的上下文中，“内容”一词可能指不同的事物，例如写作风格，单词和字符的形状或写作的真实性。因此，可以执行两种不同类型的检索：“按示例查询”和语义（或“按文本查询”）检索。尽管它们两个在现实世界中都有各自的应用程序，但是第二个更为直观和用户友好，因为它不仅使用了底层底层计算功能，而且还使用了对文档的理解。这项工作探索了几种自动技术来对手写文档集合进行两种类型的检索。这些技术有三方面：（i）索引编制；（ii）“以示例查询”检索；以及（iii）“以文本查询”检索。对于索引，我们重点关注分词和笔录映射问题。单词分割是将文本行图像分割成单词图像的任务，这是执行任何单词级别分析或识别的最重要的预处理步骤之一。我们建议使用具有一组新的全局和局部特征的神经网络来对词间和词内间隙进行分类。笔录映射问题是手写文档图像与其笔录之间的对齐问题。仅仅因为分词算法容易出错，这并不是一件容易的事。提出了一种基于识别的动态规划算法来解决这一问题。还显示它可以提高自动分词的准确性。在“示例查询”检索中，查询可以是整页文档或单个单词图像。对于文档级别的检索，学习统计模型以确定两个文档的书写样式是否相似。 Gamma和高斯分布用于建模。词级检索由基于特征的相似度搜索算法执行。为此，为每个单词图像提取一个1024位的二进制特征向量。 “按文本查询”检索是一个更具挑战性的任务，因为单词级别的分割容易出错，并且具有大词典大小的单词识别仍然是一个未解决的问题。该问题的当前解决方案是手动注释集合，这很昂贵。通过从机器翻译中获取文本信息检索的思想，我们提出了一种统计方法来进行单词识别，并使用概率注释结果对手写文档进行语言模型检索。对于所有这些方法，将在几个测试集合上根据经验比较它们的性能。这项工作的主要贡献是详细审查了手写文档基于内容的图像检索的不同层次，并开发了允许图像或文本查询的检索系统。新的分词方法比以前的方法具有更好的性能，可用于法医文档分析。此外，使用拟议的成绩单映射算法创建了一个大型手写数据库，包含3824页（约573,600个带标签的单词）。该数据库主要用于本文，可为将来的笔迹分析和识别研究提供有用的资源。

著录项

作者
Huang, Chen.;
展开▼
作者单位

State University of New York at Buffalo.$bComputer Science and Engineering.;

展开▼
授予单位 State University of New York at Buffalo.$bComputer Science and Engineering.;
学科 Computer Science.
学位 Ph.D.
年度 2008
页码 121 p.
总页数 121
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Word Matching Using Single Closed Contours For Indexing Handwritten Historical Documents [J] . Tomasz Adamek, Noel E, O, International Journal on Document Analysis and Recognition . 2007,第2a4期

机译：使用单个闭合轮廓进行单词匹配以索引手写历史文档
2. The identification of butterfly families using content-based image retrieval. [J] . Wang JiangNing, Ji LiQiang, Liang AiPing, Biosystems Engineering . 2012,第1期

机译：使用基于内容的图像检索来识别蝴蝶科。
3. Ontology of gaps in content-based image retrieval. [J] . Deserno TM, Antani S, Long R Journal of digital imaging: the official journal of the Society for Computer Applications in Radiology . 2009,第2期

机译：基于内容的图像检索中的空白本体。
4. Use of a JPEG-2000 Wavelet Compression Scheme for Content-Based Ophtalmologic Retinal Images Retrieval. [C] . M. Lamard, W. Daccache, G. Cazuguel, . -1

机译：JPEG-2000小波压缩方案在基于内容的眼科视网膜图像检索中的使用。
5. A machine learning approach to content-based image indexing and retrieval. [D] . Chen, Yixin. 2003

机译：一种基于内容的图像索引和检索的机器学习方法。
6. Automated indexing for full text information retrieval. [O] . D. C. Berrios 2000

机译：自动索引用于全文检索。
7. The use of subword-based audio indexing in Chinese spoken document retrieval. [O] . 2001

机译：基于子词的音频索引在中文口语文档检索中的应用。
8. Indexing Multispectral Images for Content-Based Retrieval. [R] . Barros, J., French, J., Martin, W., 1994

机译：为基于内容的检索索引多光谱图像。

Content-based handwritten document indexing and retrieval.

摘要

著录项

相似文献

相关主题

期刊订阅