首页> 外文学位 >Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods
【24h】

Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods

机译:使用经典和深度学习方法对非结构化文档中的元数据进行分析

获取原文
获取原文并翻译 | 示例

摘要

Metadata by definition is any set of data that describes and provides information about other data. Specifically, document metadata entails any information that can better represent, or guide in the improved understanding of a document. The most common document metadata available include title, author, edit time etc, which are auto-generated at the time of file creation. There is also content-based metadata but often is currently overlooked e.g. information from graphics, author specific characteristics etc. In this thesis, we focus on studying approaches to extracting and understanding such implicit content-based document metadata in machine-printed and handwritten documents. The two key contributions of this thesis work are to handle (a) graphics metadata: where we offer new approaches to extract and understand information graphics and (b) handwritten text metadata: where we seek to capture author specific feature representation.;The vast amount of publicly available scanned handwritten document collections are unstructured but current approaches such as OCRs make the assumption that the document under consideration maintains a uniform structure. Hence in non-uniform documents they find it challenging to handle text and non-text data, for e.g. in a line plot with text content an OCR would overlook the text data. This calls for an automated technique to process these types of documents and digitize them.;In the first part of the thesis, we study a class of deep learning architectures to help us segment the different parts of the document image. Specifically, to facilitate segmentation, we discuss a novel approach using convolution neural networks(CNN) to learn a feature representation for different types of data like machine-printed, handwritten text, graphics.;The second part of the thesis addresses obtaining information from non-text data which opens an unexplored avenue of metadata information, thus advancing existing text data understanding techniques. We discuss novel methods to extract text and non-text data from information graphics like line plots, phase diagrams etc and infer a representational message using Bayesian networks.;Finally, we discuss a neural network model that performs adaptive handwriting recognition and works with limited labeled data. Long Short Term Memory (LSTM) is a sub-class of algorithms that have been used in the domain of handwriting recognition over the years. We postulated that authors follow a unique writing style both in terms of handwriting and sentence formulation, hence we developed an adaptive LSTM-based handwriting recognition model. We exploit the user-specific features by adapting neural networks to better recognize handwritten text.;In summary, this thesis discusses an end-to-end system for converting a collection of documents into a digital archive. These digital archives will enable indexing and searching the collection. We implement a CNN based network to spot the different section in individual pages of the collection. And on the identified text sections we propose to implement a LSTM and neural network based language model for recognition and transcription. Finally, we discuss some approaches to handle non-text data and since understanding graphics needed some definitive goals, we focus specifically on information graphics such as line plots and phase diagrams in our work.
机译:根据定义,元数据是描述并提供有关其他数据的信息的任何数据集。具体而言,文档元数据包含可以更好地表示或指导对文档的更好理解的任何信息。可用的最常见文档元数据包括标题,作者,编辑时间等,它们在文件创建时自动生成。也有基于内容的元数据,但目前经常被忽略,例如信息来自图形,作者的特定特征等。在本文中,我们重点研究在机器打印和手写文档中提取和理解此类基于内容的隐式文档元数据的方法。本论文工作的两个主要贡献是处理(a)图形元数据:我们提供提取和理解信息图形的新方法,以及(b)手写文本元数据:我们寻求捕获作者特定的特征表示形式。公开扫描的手写文档集合中的一部分是非结构化的,但是当前的方法(例如OCR)假设所考虑的文档保持统一的结构。因此,在非统一文档中,他们发现处理文本和非文本数据具有挑战性,例如在带有文本内容的折线图中,OCR会忽略文本数据。这需要一种自动技术来处理这些类型的文档并将其数字化。在本文的第一部分,我们研究了一类深度学习架构,以帮助我们分割文档图像的不同部分。具体来说,为促进分割,我们讨论了一种使用卷积神经网络(CNN)来学习不同类型数据的特征表示的新方法,例如机器打印的手写文本,图形。 -文本数据,它开辟了元数据信息的未探索之路,从而促进了现有的文本数据理解技术。我们讨论了从信息图形(如线图,相图等)中提取文本和非文本数据并使用贝叶斯网络推断出表示性消息的新颖方法;最后,我们讨论了一种神经网络模型,该模型执行自适应手写识别并在有限标记下工作数据。长短期记忆(LSTM)是多年来在手写识别领域中使用的算法的子类。我们假设作者在手写和句子表达方面都遵循独特的写作风格,因此我们开发了基于LSTM的自适应手写识别模型。我们通过适应神经网络来更好地识别手写文本,从而利用了用户特定的功能。总之,本文讨论了一种将文档集合转换为数字档案的端到端系统。这些数字档案馆将使索引和搜索馆藏成为可能。我们实现了基于CNN的网络,以在集合的各个页面中发现不同的部分。在确定的文本部分,我们建议实现基于LSTM和基于神经网络的语言模型进行识别和转录。最后,我们讨论了处理非文本数据的一些方法,并且由于了解图形需要确定的目标,因此我们在工作中特别关注信息图形,例如折线图和相图。

著录项

  • 作者

    Nair, Rathin Radhakrishnan.;

  • 作者单位

    State University of New York at Buffalo.;

  • 授予单位 State University of New York at Buffalo.;
  • 学科 Computer science.;Artificial intelligence.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 147 p.
  • 总页数 147
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号