首页> 外文期刊>Information Sciences: An International Journal >Multi-lingual date field extraction for automatic document retrieval by machine
【24h】

Multi-lingual date field extraction for automatic document retrieval by machine

机译:多语言日期字段提取,可通过机器自动检索文档

获取原文
获取原文并翻译 | 示例
           

摘要

Robotic intelligence has recently received significant attention in the research community. Application of such artificial intelligence can be used to perform automatic document retrieval and interpretation by a robot through query. So, it is necessary to extract the key information from the document based on the query to produce the desired feedback. For this purpose, in this paper we propose a system for automatic date field extraction from multi-lingual (English, Devnagari and Bangla scripts) handwritten documents. The date is a key piece of information, which can be used in various robotic applications such as date-wise document indexing/retrieval. In order to design the system, first the script of the document is identified, and based on the identified script, word components of each text line are classified into month and non-month classes using word-level feature extraction and classification. Next, non-month words are segmented into individual components and labelled into one of text, digit, punctuation or contraction categories. Subsequently, the date patterns are searched using the labelled components. Both numeric and semi-numeric regular expressions have been used for date part extraction. Dynamic Time Warping (DTW) and profile feature-based approaches are used for classification of monthon-month words. Other date components such as numerals and punctuation marks are recognised using a gradient-based feature and Support Vector Machine (SVM) classifier. The experiments are performed on English, Devnagari and Bangla document datasets and the encouraging results obtained from the system indicate the effectiveness of the proposed system. (C) 2014 Elsevier Inc. All rights reserved.
机译:机器人智能最近在研究界受到了广泛关注。这种人工智能的应用可以被机器人用于通过查询执行自动文档检索和解释。因此,有必要根据查询从文档中提取关键信息以产生所需的反馈。为此,在本文中,我们提出了一种从多语言(英文,德文加里语和孟加拉语脚本)手写文档中自动提取日期字段的系统。日期是关键信息,可以在各种机器人应用程序中使用,例如按日期进行文档索引/检索。为了设计系统,首先要识别文档的脚本,然后根据所识别的脚本,使用单词级特征提取和分类将每个文本行的单词成分分为月和非月类。接下来,将非月单词分割成各个部分,并标记为文本,数字,标点或收缩类别之一。随后,使用标记的组件搜索日期模式。数字和半数字正则表达式都已用于日期部分提取。动态时间规整(DTW)和基于配置文件功能的方法用于对月/非月单词进行分类。使用基于梯度的功能和支持向量机(SVM)分类器可以识别其他日期成分,例如数字和标点符号。实验是在英语,Devnagari和Bangla文档数据集中进行的,从该系统获得的令人鼓舞的结果表明了该系统的有效性。 (C)2014 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号