...
首页> 外文期刊>Journal of database management >Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents
【24h】

Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

机译:基于隐式语义的学术文档元数据提取与匹配

获取原文
获取原文并翻译 | 示例
           

摘要

The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.
机译:作者建议使用格式化模板和隐式格式化语义信息来进行自动元数据识别和分段。可以并行识别纯文本及其相应的格式信息,包括行高,字体类型和字体大小,以指导元数据标识。作者使用隐式格式语义,例如格式更改,格式模板和含义,显式格式布局以及预定义的频繁出现的关键字数据库,以提高提取精度。与其他基于OCR的方法不同,作者使用开源PDFBox包作为基本的预处理工具,以获取纯文本和文档内容的格式值。他们在PDFBox的顶部构建了自己的管道程序,即PAXAT,以实现其元数据提取方法。测试了来自arXiv,ACM,ACL和其他公共访问和机构订阅来源的10177篇论文。标题,作者,隶属关系,作者-隶属关系匹配的整体提取精度分别为0.9798、0.9425、0.9298和0.9109。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号