Metadata Extraction Approach of PDF Documents Based on Measurement Fusion

Junmin Zhao; Huazhong Liu

首页> 外文期刊>Journal of Multimedia >Metadata Extraction Approach of PDF Documents Based on Measurement Fusion

【24h】

Metadata Extraction Approach of PDF Documents Based on Measurement Fusion

机译：基于测量融合的PDF文档元数据提取方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

To deal with the problems of low precision rate and weak adaptability in the existing metadata extraction methods, a novel metadata extraction approach is proposed based on measurement fusion rule in this paper. First, the features of the document header are extracted, the three statistical learning methods such as HMM, SVM and CRF are respectively employed to train the labeled data set, and corresponding metadata extraction models are constructed. Then, the results from three extraction models are fused by the sum rule so as to achieve the accurate metadata extraction of documents. Finally, we dynamically update the three extraction models to guarantee the effectiveness of the ensemble models by the time period statistics-based method. Experiments on different datasets are conducted and the comparative results of these extraction methods are presented; Experimental results show that the proposed approach not only improves the precision of metadata extraction, but also enhances the adaptability.

机译：针对现有元数据提取方法精度较低，适应性较弱的问题，提出了一种基于度量融合规则的元数据提取方法。首先，提取文档标题的特征，分别采用HMM，SVM和CRF等三种统计学习方法训练标记数据集，并构建相应的元数据提取模型。然后，将三个提取模型的结果通过求和规则进行融合，以实现文档的准确元数据提取。最后，我们通过基于时间周期统计的方法动态更新三个提取模型，以确保集成模型的有效性。在不同的数据集上进行了实验，并给出了这些提取方法的比较结果。实验结果表明，该方法不仅提高了元数据提取的精度，而且增强了适应性。

著录项

来源
《Journal of Multimedia》 |2013年第6期|732-738|共7页
作者
Junmin Zhao; Huazhong Liu;
展开▼
作者单位

Henan University of Urban Construction/Institute of Computer Science and Engineering, Pingdingshan, China;

Beijing Mapbar Technology Co. Ltd. Beijing, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Metadata Extraction; Statistical Learning; Measurement Fusion; Posterior Probability; Sum Rule;

机译：元数据提取;统计学习;测量融合后验概率;求和规则;

相似文献

外文文献
中文文献
专利

1. Rule Based Chunk Extraction from PDF Documents Using Regular Expressions and Natural Language Processing [J] . Amol Rajaram Karad, Rahul Raghvendra Joshi International journal of computational intelligence research . 2021,第1期

机译：使用正则表达式和自然语言处理从PDF文档的规则的块提取
2. Rule Based Chunk Extraction from PDF Documents Using Regular Expressions and Natural Language Processing [J] . Amol Rajaram Karad, Rahul Raghvendra Joshi International Journal of Applied Engineering Research . 2015,第3期

机译：使用正则表达式和自然语言处理从PDF文档中基于规则的块提取
3. ONTOLOGY-BASED INFORMATION EXTRACTION FROM PDF DOCUMENTS WITH XONTO [J] . ERMELINDA ORO, MASSIMO RUFFOLO, DOMENICO SACCA International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms . 2009,第5期

机译：使用Xonto从PDF文档中基于本体的信息提取
4. Rule Based Approach to Extract Metadata from Scientific PDF Documents [C] . Ahmer Maqsood Hashmi, Muhammad Tanvir Afzal, Sabih ur Rehman International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications . 2020

机译：基于规则的科学PDF文档提取元数据的方法
5. Image analysis and metadata extraction for document search . [D] . Lu, Xiaonan. 2008

机译：图像分析和元数据提取用于文档搜索。
6. A Hybrid Fault Diagnosis Approach for Rotating Machinery with the Fusion of Entropy-Based Feature Extraction and SVM Optimized by a Chaos Quantum Sine Cosine Algorithm [O] . Wenlong Fu, Jiawen Tan, Chaoshun Li, 2018

机译：一种旋转机械的混合故障诊断方法通过混沌量子正弦余弦算法优化了基于熵的特征提取和SVM的融合
7. Evaluation of header metadata extraction approaches and tools for scientific pdf documents [O] . Mario Lipinski, Kevin Yao, Corinna Breitinger, 2013

机译：评估用于科学pdf文档的标题元数据提取方法和工具

Metadata Extraction Approach of PDF Documents Based on Measurement Fusion

摘要

著录项

相似文献

相关主题

期刊订阅