Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

Jiang Congfeng; Liu Junming; Ou Dongyang; Wang Yumei; Yu Lifeng

首页> 外文期刊>Journal of database management >Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

【24h】

Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

机译：基于隐式语义的学术文档元数据提取与匹配

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.

机译：作者建议使用格式化模板和隐式格式化语义信息来进行自动元数据识别和分段。可以并行识别纯文本及其相应的格式信息，包括行高，字体类型和字体大小，以指导元数据标识。作者使用隐式格式语义，例如格式更改，格式模板和含义，显式格式布局以及预定义的频繁出现的关键字数据库，以提高提取精度。与其他基于OCR的方法不同，作者使用开源PDFBox包作为基本的预处理工具，以获取纯文本和文档内容的格式值。他们在PDFBox的顶部构建了自己的管道程序，即PAXAT，以实现其元数据提取方法。测试了来自arXiv，ACM，ACL和其他公共访问和机构订阅来源的10177篇论文。标题，作者，隶属关系，作者-隶属关系匹配的整体提取精度分别为0.9798、0.9425、0.9298和0.9109。

著录项

来源
《Journal of database management》 |2018年第2期|1-22|共22页
作者
Jiang Congfeng; Liu Junming; Ou Dongyang; Wang Yumei; Yu Lifeng;
展开▼
作者单位

Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China;

Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China;

Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China;

Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China;

Hithink RoyalFlush Informat Network Co Ltd, Hangzhou, Zhejiang, Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Formatting Semantics; Information Retrieval; Metadata Extraction; PDF Document; Template;

机译：格式语义;信息检索;元数据提取;PDF文档;模板;

相似文献

外文文献
中文文献
专利

1. Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents [J] . Iqra Safder, Saeed-Ul Hassan, Anna Visvizi, Information Processing & Management . 2020,第6期

机译：全文学术文档中算法元数据的深度学习提取
2. Metadata Extraction Approach of PDF Documents Based on Measurement Fusion [J] . Junmin Zhao, Huazhong Liu Journal of Multimedia . 2013,第6期

机译：基于测量融合的PDF文档元数据提取方法
3. Sample-Based Collection and Adjustment of Rules for Metadata Extraction in Business Documents [J] . TOSHIKO MATSUMOTO, MITSUHARU OBA, TAKASHI ONOYAMA, Electronics and communications in Japan . 2012,第6期

机译：基于样本的收集和业务文档中元数据提取规则的调整
4. Header Metadata Extraction from Semi-structured Documents Using Template Matching [C] . Zewu Huang, Hai Jin, Pingpeng Yuan, On the Move to Meaningful Internet Systems 2006: OTM 2006 Workshops pt.2; Lecture Notes in Computer Science; 4278 . 2006

机译：使用模板匹配从半结构化文档中提取标头元数据
5. Image analysis and metadata extraction for document search . [D] . Lu, Xiaonan. 2008

机译：图像分析和元数据提取用于文档搜索。
6. Disease causality extraction based on lexical semantics and document-clause frequency from biomedical literature [O] . Dong-gi Lee, Hyunjung Shin 2017

机译：基于词义和文献从句频率的生物医学文献疾病因果关系提取
7. Contextual and Metadata-based Approach for the Semantic Annotation of Heterogeneous Documents [O] . Thiam Mouhamadou, Pernelle Nathalie, Bennacer Nacéra 2008

机译：基于上下文和元数据的异构文档语义标注方法

Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅