Multi-lingual date field extraction for automatic document retrieval by machine

Mandal Ranju; Roy Partha Pratim; Pal Umapada; Blumenstein Michael

首页> 外文期刊>Information Sciences: An International Journal >Multi-lingual date field extraction for automatic document retrieval by machine

【24h】

Multi-lingual date field extraction for automatic document retrieval by machine

机译：多语言日期字段提取，可通过机器自动检索文档

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Robotic intelligence has recently received significant attention in the research community. Application of such artificial intelligence can be used to perform automatic document retrieval and interpretation by a robot through query. So, it is necessary to extract the key information from the document based on the query to produce the desired feedback. For this purpose, in this paper we propose a system for automatic date field extraction from multi-lingual (English, Devnagari and Bangla scripts) handwritten documents. The date is a key piece of information, which can be used in various robotic applications such as date-wise document indexing/retrieval. In order to design the system, first the script of the document is identified, and based on the identified script, word components of each text line are classified into month and non-month classes using word-level feature extraction and classification. Next, non-month words are segmented into individual components and labelled into one of text, digit, punctuation or contraction categories. Subsequently, the date patterns are searched using the labelled components. Both numeric and semi-numeric regular expressions have been used for date part extraction. Dynamic Time Warping (DTW) and profile feature-based approaches are used for classification of monthon-month words. Other date components such as numerals and punctuation marks are recognised using a gradient-based feature and Support Vector Machine (SVM) classifier. The experiments are performed on English, Devnagari and Bangla document datasets and the encouraging results obtained from the system indicate the effectiveness of the proposed system. (C) 2014 Elsevier Inc. All rights reserved.

机译：机器人智能最近在研究界受到了广泛关注。这种人工智能的应用可以被机器人用于通过查询执行自动文档检索和解释。因此，有必要根据查询从文档中提取关键信息以产生所需的反馈。为此，在本文中，我们提出了一种从多语言（英文，德文加里语和孟加拉语脚本）手写文档中自动提取日期字段的系统。日期是关键信息，可以在各种机器人应用程序中使用，例如按日期进行文档索引/检索。为了设计系统，首先要识别文档的脚本，然后根据所识别的脚本，使用单词级特征提取和分类将每个文本行的单词成分分为月和非月类。接下来，将非月单词分割成各个部分，并标记为文本，数字，标点或收缩类别之一。随后，使用标记的组件搜索日期模式。数字和半数字正则表达式都已用于日期部分提取。动态时间规整（DTW）和基于配置文件功能的方法用于对月/非月单词进行分类。使用基于梯度的功能和支持向量机（SVM）分类器可以识别其他日期成分，例如数字和标点符号。实验是在英语，Devnagari和Bangla文档数据集中进行的，从该系统获得的令人鼓舞的结果表明了该系统的有效性。（C）2014 Elsevier Inc.保留所有权利。

著录项

来源
《Information Sciences: An International Journal》 |2015年第null期|共16页
作者
Mandal Ranju; Roy Partha Pratim; Pal Umapada; Blumenstein Michael;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动信息理论;
关键词
Robot reading; Robot retrieval of document; Date-based indexing; Handwritten date extraction; Date spotting; Multi-lingual documents;

机译：机器人阅读;机器人对文档的检索;基于日期的索引;手写数据提取;日期查找;多语言文档;

相似文献

外文文献
中文文献
专利

1. Multi-lingual date field extraction for automatic document retrieval by machine [J] . Mandal Ranju, Roy Partha Pratim, Pal Umapada, Information Sciences: An International Journal . 2015,第Null期

机译：多语言日期字段提取，可通过机器自动检索文档
2. AUTOMATIC MACHINE LEARNING OF KEYPHRASE EXTRACTION FROM SHORT HTML DOCUMENTS WRITTEN IN HEBREW [J] . YAAKOV HACOHEN-KERNER, ITTAY STERN, DAVID KORKUS, Cybernetics and Systems . 2007,第1期

机译：从希伯来语简短HTML文档中提取关键词的自动机器学习
3. Automatic extraction of titles from general documents using machine learning [J] . Yunhua Hu, Hang Li, Yunbo Cao, Information Processing & Management . 2006,第5期

机译：使用机器学习从一般文档中自动提取标题
4. Shape Code Based Word-Image Matching for Retrieval of Indian Multi-lingual Documents [C] . Tarafdar Arundhati, Mondal Ranju, Pal Srikanta, 2010 20th International Conference on Pattern Recognition . 2010

机译：基于形状码的单词图像匹配检索印度多语种文件
5. InforadarML: A multi-lingual information discovery tool exploiting automatic document categorization. [D] . Valiente-Fernandez, Jairo E. 2003

机译：InforadarML：利用自动文档分类的多语言信息发现工具。
6. Easing semantically enriched information retrieval—An interactive semi-automatic annotation system for medical documents [O] . Theresia Gschwandtner, Katharina Kaiser, Patrick Martini, -1

机译：在语义上富集的信息检索 - 用于医疗文档的交互式半自动注释系统
7. Multi-lingual date field extraction for automatic document retrieval by machine [O] . Mandal Ranju, Roy Partha Pratim, Pal Umapada, 2015

机译：多语言日期字段提取，可通过机器自动检索文档
8. Knowledge Based Automatic Extraction of the Machinable Surfaces for Automatic CAD-CAM (Computer Aided Design-Computer Aided Manufacturing) System [R] . Rodrigues, V., Vescovi, M. R. 1988

机译：基于知识的自动CaD-Cam（计算机辅助设计 - 计算机辅助制造）系统可加工表面的自动提取

Multi-lingual date field extraction for automatic document retrieval by machine

摘要

著录项

相似文献

相关主题

期刊订阅