Authors#x2019; names extraction from scanned documents

机译：作者的名称从扫描的文件提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Authors’ names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as Optical Character Recognition (OCR). In this paper, we describe an automatic authors’ names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors’ blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed Hidden Markov Model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99% of authors’ blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.

机译：在搜索或浏览存储在数字图书馆中的学术文章时，作者的名称是一个关键的书目元素。但是，从印刷文件中提取这些书目数据需要人为干预;因此，即使使用诸如光学字符识别（OCR）的各种文档图像处理技术，也不是经济效益。在本文中，我们描述了一种自动作者的名称提取方法，用于OCR标记扫描的学术论文。所提出的方法首先提取作者的块，该块包括假定的作者/分隔字符，基于布局分析，然后使用专门设计的隐马尔可夫模型（HMM）来标记块中的未分段字符串，作为作者或a的块中的未分段字符串分隔符。我们将拟议的方法应用于日本学术文章。这些实验的结果表明，该方法用手动调整正确提取了99％以上的作者块;建议的嗯正确标记了95％以上的作者名称字符串。

著录项

来源
《International Conference on Digital Information Management》|2007年||共6页
会议地点
作者
Manabu Ohta; Shun Yamasaki; Takayuki Yakushi; Atsuhiro Takasu; ICDIM;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. Ingram Micro Named Fastest Growing Fujitsu Document Scanning Distributor [J] . International Journal of Micrographics & Optical Technology . 2010,第1a2期

机译：Ingram Micro被评为成长最快的富士通文档扫描分销商
2. Text, photo, and line extraction in scanned documents [J] . M. Sezer Erkilinc, Mustafa Jaber, Eli Saber, Journal of electronic imaging . 2012,第3期

机译：扫描文档中的文本，照片和行提取
3. Seed based named entity extraction applied to English resume document [J] . Mukta S. Takalikar, Manali M. Kshirsagar International Journal of Engineering & Technology . 2018,第4期

机译：基于种子的命名实体提取应用于英语简历文档
4. Authors#x2019; names extraction from scanned documents [C] . Manabu Ohta, Shun Yamasaki, Takayuki Yakushi, International Conference on Digital Information Management . 2007

机译：作者的名称从扫描的文件提取
5. Leveraging knowledge of document structure and named entities for information extraction. [D] . Duncan, Frank Bissett. 2005

机译：利用文档结构和命名实体的知识进行信息提取。
6. A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models [O] . Dharitri Misra, Siyuan Chen, George R. Thoma -1

机译：使用布局识别和字符串模式搜索模型从扫描文档中自动提取元数据的系统
7. Character Keypoint-Based Homography Estimation in Scanned Documents for Efficient Information Extraction [O] . Kushagra Mahajan, Monika Sharma, Lovekesh Vig 2019

机译：基于字符基于KeyPoint的扫描文档的同号估计，以获得高效信息提取
8. 35mm Aerial Compliance Slide Scanning: Recommendations for the Scanning and Naming of Aerial Compliance 35mm Slides [R] . 2001

机译：35mm空中合规性滑动扫描：建议扫描和命名35mm空中合规空间

Authors#x2019; names extraction from scanned documents

摘要

著录项

相似文献

相关主题

期刊订阅