...
首页> 外文期刊>Program: Automated Library and Information Systems >Extracting bibliographical data for PDF documents with HMM and external resources
【24h】

Extracting bibliographical data for PDF documents with HMM and external resources

机译:使用HMM和外部资源提取PDF文档的书目数据

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose - The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs). Design/methodology/approach - The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeer~x, SDOS, and Google Scholar) to retrieve the rest of metadata. Findings - Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotera 4.0. The feasibility of the proposed model is thus justified. Research limitations/implications - For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers' productivity. Practical implications - For practical implication, the system can outperform the existing tool, Zotera v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation. Originality/value - The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.
机译:目的-本文的目的是提出一种自动的元数据提取和检索系统,以可移植文档格式(PDF)从数字学术文档中提取书目信息。设计/方法/方法-作者使用PDFBox提取文本和字体大小信息,使用基于规则的方法识别标题,并使用隐马尔可夫模型(HMM)提取标题和作者。最后,提取的标题和作者(可能不正确或不完整)作为查询字符串发送到数字图书馆(例如ACM,IEEE,CiteSeer〜x,SDOS和Google Scholar),以检索其余的元数据。调查结果-进行了四个实验,以检验该系统的可行性。第一个实验比较了两种不同的HMM模型:多状态模型和一个状态模型(建议的模型)。结果表明,一个状态模型可以具有与多状态模型相当的性能,但更适合处理现实世界中的未知状态。第二个实验表明,我们提出的模型(无需在线查询)可以在Cora纸张标题数据集上获得与其他研究人员模型一样好的性能。在第三个实验中,本文在一个包含43个真实PDF研究论文的小型数据集上研究了我们系统的性能。结果表明,我们提出的系统(带有在线查询)在书目数据提取上可以很好地执行,甚至优于免费的引文管理工具Zotero 3.0。最后,本文使用103个论文的更大数据集进行了第四次实验,以将我们的系统与Zotero 4.0进行比较。结果表明,我们的系统明显优于Zotera 4.0。因此证明了所提出模型的可行性。研究的局限性/含义-从学术意义上讲,该系统具有两个方面的独特之处:首先,该系统仅将Cora标头集用于HMM培训,而没有使用其他带标签的数据集或地名词典资源,这意味着该系统轻巧且可扩展。其次,该系统是可行的,并且可以应用于提取实际PDF文件的元数据。然后可以将提取的书目数据导入诸如尾注或refworks之类的引文软件中,以提高研究人员的工作效率。实际意义-对于实际意义,该系统可以胜过现有工具Zotera v4.0。这为从业人员提供了在实际应用中开发类似产品的良好机会;尽管可能需要一些有关HMM实施的知识。原创性/价值-HMM的实现并非新颖。创新之处在于它实际上结合了两个HMM模型。主要模型改编自Freitag和Mccallum(1999),作者在其中添加了Nymble HMM的单词特征(Bikel等,1997)。该系统即使在训练模型之前无需手动标记数据集(作者只是使用cora数据集在真实的PDF纸上进行训练和测试)也可以使用,因为这与迄今为止的其他工作有很大不同。实验结果已经证明了我们提出的方法在这方面的可行性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号