Automatic Document Metadata Extraction Based on Deep Networks

机译：基于深网络的自动文档元数据提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Metadata information extraction from academic papers is of great value to many applications such as scholar search, digital library, and so on. This task has attracted much attention from researchers in the past decades, and many templates-based or statistical machine learning (e.g. SVM, CRF, etc.)-based extraction methods have been proposed, while this task is still a challenge because of the variety and complexity of page layout. To address this challenge, we try introducing the deep learning networks to this task in this paper, since deep learning has shown great power in many areas like computer vision (CV) and natural language processing (NLP). Firstly, we employ the deep learning networks to model the image information and the text information of paper headers respectively, which allow our approach to perform metadata extraction with little information loss. Then we formulate the problem, metadata extraction from a paper header, as two typical tasks of different areas: object detection in the area of CV, and sequence labeling in the area of NLP. Finally, the two deep networks generated from the above two tasks are combined together to give extraction results. The primary experiments show that our approach achieves state-of-the-art performance on several open datasets. At the same time, this approach can process both image data and text data, and does not need to design any classification feature.

机译：来自学术论文的元数据信息提取对于许多诸如学者搜索，数字图书馆等的许多应用程序具有重要价值。这项任务已经吸引了过去几十年的研究人员的关注，并提出了许多基于模板或统计机器学习（例如SVM，CRF等）的提取方法，而这项任务仍然是挑战因素和页面布局的复杂性。为了解决这一挑战，我们在本文中尝试将深度学习网络引入此任务，因为深度学习在计算机视觉（CV）和自然语言处理（NLP）等许多领域都有很大的力量。首先，我们采用深度学习网络分别模拟纸张头的图像信息和文本信息，这允许我们的方法执行具有很少的信息丢失的元数据提取。然后我们制定问题，从纸质报头的元数据提取，作为不同区域的两个典型任务：在CV区域中的对象检测，以及NLP区域中的序列标记。最后，从上述两个任务产生的两个深网络组合在一起以提供提取结果。主要实验表明，我们的方法在几个开放数据集中实现了最先进的性能。同时，此方法可以处理图像数据和文本数据，不需要设计任何分类功能。

著录项

来源
《International Conference on Natural Language Processing and Chinese Computing》|2017年|966p|共13页
会议地点
作者
Runtao Liu; Liangcai Gao; Dong An; Zhuoren Jiang; Zhi Tang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP312-53;
关键词
Information extraction; Ensemble modeling Convolutional Neural Networks; Sequence labeling Recurrent Neural Networks;

机译：信息提取;集合建模卷积神经网络;序列标记经常性神经网络;

相似文献

外文文献
中文文献
专利

1. Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents [J] . Iqra Safder, Saeed-Ul Hassan, Anna Visvizi, Information Processing & Management . 2020,第6期

机译：全文学术文档中算法元数据的深度学习提取
2. Automatic Extraction of Roadside Traffic Facilities From Mobile Laser Scanning Point Clouds Based on Deep Belief Network [J] . Fang Lina, Shen Guixi, Luo Haifeng, IEEE Transactions on Intelligent Transportation Systems . 2021,第4期

机译：基于深度信仰网络的移动激光扫描点云自动提取路边交通设施
3. Automatic Building Extraction from Google Earth Images under Complex Backgrounds Based on Deep Instance Segmentation Network [J] . Wen Qi, Jiang Kaiyu, Wang Wei, Nature reviews Cancer . 2019,第2期

机译：基于深度实例分割网络的复杂背景下的谷歌地球图像自动建筑提取
4. Automatic Document Metadata Extraction Based on Deep Networks [C] . Runtao Liu, Liangcai Gao, Dong An, Natural language understanding and intelligent applications . 2017

机译：基于深度网络的文档元数据自动提取
5. Data mining revision controlled document history metadata for automatic classification. [D] . Maass, Dustin. 2013

机译：数据挖掘修订版本控制的文档历史记录元数据，用于自动分类。
6. Automatic Building Extraction from Google Earth Images under Complex Backgrounds Based on Deep Instance Segmentation Network [O] . Qi Wen, Kaiyu Jiang, Wei Wang, 2019

机译：基于深度实例分割网络的复杂背景下Google Earth图像的自动建筑物提取
7. Automatic extraction of table metadata from digital documents [O] . Ying Liu, Prasenjit Mitra, C. Lee Giles, 2006

机译：从数字文档中自动提取表格元数据

Automatic Document Metadata Extraction Based on Deep Networks

摘要

著录项

相似文献

相关主题

期刊订阅