首页> 外文期刊>Journal of Multimedia >Metadata Extraction Approach of PDF Documents Based on Measurement Fusion
【24h】

Metadata Extraction Approach of PDF Documents Based on Measurement Fusion

机译:基于测量融合的PDF文档元数据提取方法

获取原文
获取原文并翻译 | 示例
           

摘要

To deal with the problems of low precision rate and weak adaptability in the existing metadata extraction methods, a novel metadata extraction approach is proposed based on measurement fusion rule in this paper. First, the features of the document header are extracted, the three statistical learning methods such as HMM, SVM and CRF are respectively employed to train the labeled data set, and corresponding metadata extraction models are constructed. Then, the results from three extraction models are fused by the sum rule so as to achieve the accurate metadata extraction of documents. Finally, we dynamically update the three extraction models to guarantee the effectiveness of the ensemble models by the time period statistics-based method. Experiments on different datasets are conducted and the comparative results of these extraction methods are presented; Experimental results show that the proposed approach not only improves the precision of metadata extraction, but also enhances the adaptability.
机译:针对现有元数据提取方法精度较低,适应性较弱的问题,提出了一种基于度量融合规则的元数据提取方法。首先,提取文档标题的特征,分别采用HMM,SVM和CRF等三种统计学习方法训练标记数据集,并构建相应的元数据提取模型。然后,将三个提取模型的结果通过求和规则进行融合,以实现文档的准确元数据提取。最后,我们通过基于时间周期统计的方法动态更新三个提取模型,以确保集成模型的有效性。在不同的数据集上进行了实验,并给出了这些提取方法的比较结果。实验结果表明,该方法不仅提高了元数据提取的精度,而且增强了适应性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号