首页> 外文OA文献 >Multi-lingual date field extraction for automatic document retrieval by machine
【2h】

Multi-lingual date field extraction for automatic document retrieval by machine

机译:多语言日期字段提取,可通过机器自动检索文档

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Robotic intelligence has recently received significant attention in the research community. Application of such artificial intelligence can be used to perform automatic document retrieval and interpretation by a robot through query. So, it is necessary to extract the key information from the document based on the query to produce the desired feedback. For this purpose, in this paper we propose a system for automatic date field extraction from multi-lingual (English, Devnagari and Bangla scripts) handwritten documents. The date is a key piece of information, which can be used in various robotic applications such as date-wise document indexing/retrieval. In order to design the system, first the script of the document is identified, and based on the identified script, word components of each text line are classified into month and non-month classes using word-level feature extraction and classification. Next, non-month words are segmented into individual components and labelled into one of text, digit, punctuation or contraction categories. Subsequently, the date patterns are searched using the labelled components. Both numeric and semi-numeric regular expressions have been used for date part extraction. Dynamic Time Warping (DTW) and profile feature-based approaches are used for classification of month/non-month words. Other date components such as numerals and punctuation marks are recognised using a gradient-based feature and Support Vector Machine (SVM) classifier. The experiments are performed on English, Devnagari and Bangla document datasets and the encouraging results obtained from the system indicate the effectiveness of the proposed system.
机译:机器人智能最近在研究界受到了广泛关注。这种人工智能的应用可以被机器人用于通过查询执行自动文档检索和解释。因此,有必要根据查询从文档中提取关键信息以产生所需的反馈。为此,在本文中,我们提出了一种从多语言(英文,德文加里语和孟加拉语脚本)手写文档中自动提取日期字段的系统。日期是关键信息,可以在各种机器人应用程序中使用,例如按日期进行文档索引/检索。为了设计系统,首先要识别文档的脚本,然后根据所识别的脚本,使用单词级特征提取和分类将每个文本行的单词成分分为月和非月类。接下来,将非月份单词分割为各个部分,并标记为文本,数字,标点或收缩类别之一。随后,使用标记的组件搜索日期模式。数字和半数字正则表达式都已用于日期部分提取。动态时间规整(DTW)和基于配置文件功能的方法用于对月/非月单词进行分类。使用基于梯度的功能和支持向量机(SVM)分类器可以识别其他日期成分,例如数字和标点符号。实验是在英语,Devnagari和Bangla文档数据集上进行的,从该系统获得的令人鼓舞的结果表明了该系统的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号