首页> 外文会议>International Conference on Digital Information Management >Authors#x2019; names extraction from scanned documents
【24h】

Authors#x2019; names extraction from scanned documents

机译:作者的名称从扫描的文件提取

获取原文

摘要

Authors’ names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as Optical Character Recognition (OCR). In this paper, we describe an automatic authors’ names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors’ blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed Hidden Markov Model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99% of authors’ blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.
机译:在搜索或浏览存储在数字图书馆中的学术文章时,作者的名称是一个关键的书目元素。但是,从印刷文件中提取这些书目数据需要人为干预;因此,即使使用诸如光学字符识别(OCR)的各种文档图像处理技术,也不是经济效益。在本文中,我们描述了一种自动作者的名称提取方法,用于OCR标记扫描的学术论文。所提出的方法首先提取作者的块,该块包括假定的作者/分隔字符,基于布局分析,然后使用专门设计的隐马尔可夫模型(HMM)来标记块中的未分段字符串,作为作者或a的块中的未分段字符串分隔符。我们将拟议的方法应用于日本学术文章。这些实验的结果表明,该方法用手动调整正确提取了99%以上的作者块;建议的嗯正确标记了95%以上的作者名称字符串。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号