首页> 外文学位 >Content-based handwritten document indexing and retrieval.
【24h】

Content-based handwritten document indexing and retrieval.

机译:基于内容的手写文档索引和检索。

获取原文
获取原文并翻译 | 示例

摘要

Information retrieval on textual data has been well studied and its applications (such as web searching) have become ubiquitous in our daily lives. However content-based image retrieval on handwritten document collections still remains a challenging problem. Here "content-based" means that the search will analyze the actual content of the images, instead of merely the metadata. In the context of handwritten documents, the word "content" might refer different things, such as writing style, shape of words and characters, or the truth of the writing. Accordingly, two different types of retrieval can be performed: "query by example" and semantic (or "query by text") retrieval. While both of them have their own applications in the real world, the second one is more intuitive and user-friendly, since it uses not only the low level underlying computational features, but also the understanding of documents.; This work explores several automatic techniques to do both types of retrieval upon handwritten document collections. These techniques are three-fold: (i) indexing, (ii) "query by example" retrieval and (iii) "query by text" retrieval.; For indexing, we focus on the problem of word segmentation and transcript mapping. Word segmentation is the task of segmenting text line images into word image, which is one of the most important preprocessing steps in order to perform any word level analysis or recognition. We propose the use of neural network with a new set of global and local features to make the classification between inter-word and intra-word gaps. The transcript mapping problem is an alignment problem between the handwritten document image and its transcript. It is not a trivial task simply because the word segmentation algorithm is error prone. A recognition based dynamic programming algorithm is proposed to solve this problem. It is also shown to improve the accuracy of automatic word segmentation.; In "query by example" retrieval, the query can be either a full page document or a single word image. For the document level retrieval, a statistical model is learned to determine whether the writing styles of two documents are similar or not. Gamma and Gaussian distributions are used for the modeling. Word level retrieval is performed by a feature based similarity search algorithm. For each word image, a 1024-bit binary feature vector is extracted for this purpose. "Query by text" retrieval is a more challenging task because word level segmentation is error prone and word recognition with large lexicon size is still an unsolved problem. The current solution for this problem is to manually annotate the collection, which is costly. By taking the idea from machine translation in textual information retrieval, we propose a statistical approach for word recognition and use the probabilistic annotation results to do language model retrieval on handwritten documents. For all these approaches, their performances are empirically compared on several test collections.; The main contributions of this work are a detailed examination of different levels of content-based image retrieval for handwritten documents, and the development of a retrieval system that allows either image or text queries. The new word segmentation method shows an improved performance over a previous method and is useful in forensic document analysis. In addition, a large handwriting database of 3824 pages (about 573,600 labeled words) was created using the proposed transcript-mapping algorithm. This database was used predominantly in this dissertation and it serves as a useful resource for future handwriting analysis and recognition research.
机译:关于文本数据的信息检索已经得到了很好的研究,其应用(例如网络搜索)在我们的日常生活中已变得无处不在。但是,基于内容的手写文档集图像检索仍然是一个具有挑战性的问题。这里的“基于内容”意味着搜索将分析图像的实际内容,而不仅仅是元数据。在手写文档的上下文中,“内容”一词可能指不同的事物,例如写作风格,单词和字符的形状或写作的真实性。因此,可以执行两种不同类型的检索:“按示例查询”和语义(或“按文本查询”)检索。尽管它们两个在现实世界中都有各自的应用程序,但是第二个更为直观和用户友好,因为它不仅使用了底层底层计算功能,而且还使用了对文档的理解。这项工作探索了几种自动技术来对手写文档集合进行两种类型的检索。这些技术有三方面:(i)索引编制;(ii)“以示例查询”检索;以及(iii)“以文本查询”检索。对于索引,我们重点关注分词和笔录映射问题。单词分割是将文本行图像分割成单词图像的任务,这是执行任何单词级别分析或识别的最重要的预处理步骤之一。我们建议使用具有一组新的全局和局部特征的神经网络来对词间和词内间隙进行分类。笔录映射问题是手写文档图像与其笔录之间的对齐问题。仅仅因为分词算法容易出错,这并不是一件容易的事。提出了一种基于识别的动态规划算法来解决这一问题。还显示它可以提高自动分词的准确性。在“示例查询”检索中,查询可以是整页文档或单个单词图像。对于文档级别的检索,学习统计模型以确定两个文档的书写样式是否相似。 Gamma和高斯分布用于建模。词级检索由基于特征的相似度搜索算法执行。为此,为每个单词图像提取一个1024位的二进制特征向量。 “按文本查询”检索是一个更具挑战性的任务,因为单词级别的分割容易出错,并且具有大词典大小的单词识别仍然是一个未解决的问题。该问题的当前解决方案是手动注释集合,这很昂贵。通过从机器翻译中获取文本信息检索的思想,我们提出了一种统计方法来进行单词识别,并使用概率注释结果对手写文档进行语言模型检索。对于所有这些方法,将在几个测试集合上根据经验比较它们的性能。这项工作的主要贡献是详细审查了手写文档基于内容的图像检索的不同层次,并开发了允许图像或文本查询的检索系统。新的分词方法比以前的方法具有更好的性能,可用于法医文档分析。此外,使用拟议的成绩单映射算法创建了一个大型手写数据库,包含3824页(约573,600个带标签的单词)。该数据库主要用于本文,可为将来的笔迹分析和识别研究提供有用的资源。

著录项

  • 作者

    Huang, Chen.;

  • 作者单位

    State University of New York at Buffalo.$bComputer Science and Engineering.;

  • 授予单位 State University of New York at Buffalo.$bComputer Science and Engineering.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 121 p.
  • 总页数 121
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号