Extracting bibliographical data for PDF documents with HMM and external resources

Wen-Feng Hsiao; Te-Min Chang; Erwin Thomas

首页> 外文期刊>Program: Automated Library and Information Systems >Extracting bibliographical data for PDF documents with HMM and external resources

【24h】

Extracting bibliographical data for PDF documents with HMM and external resources

机译：使用HMM和外部资源提取PDF文档的书目数据

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Purpose - The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs). Design/methodology/approach - The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeer~x, SDOS, and Google Scholar) to retrieve the rest of metadata. Findings - Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotera 4.0. The feasibility of the proposed model is thus justified. Research limitations/implications - For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers' productivity. Practical implications - For practical implication, the system can outperform the existing tool, Zotera v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation. Originality/value - The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

机译：目的-本文的目的是提出一种自动的元数据提取和检索系统，以可移植文档格式（PDF）从数字学术文档中提取书目信息。设计/方法/方法-作者使用PDFBox提取文本和字体大小信息，使用基于规则的方法识别标题，并使用隐马尔可夫模型（HMM）提取标题和作者。最后，提取的标题和作者（可能不正确或不完整）作为查询字符串发送到数字图书馆（例如ACM，IEEE，CiteSeer〜x，SDOS和Google Scholar），以检索其余的元数据。调查结果-进行了四个实验，以检验该系统的可行性。第一个实验比较了两种不同的HMM模型：多状态模型和一个状态模型（建议的模型）。结果表明，一个状态模型可以具有与多状态模型相当的性能，但更适合处理现实世界中的未知状态。第二个实验表明，我们提出的模型（无需在线查询）可以在Cora纸张标题数据集上获得与其他研究人员模型一样好的性能。在第三个实验中，本文在一个包含43个真实PDF研究论文的小型数据集上研究了我们系统的性能。结果表明，我们提出的系统（带有在线查询）在书目数据提取上可以很好地执行，甚至优于免费的引文管理工具Zotero 3.0。最后，本文使用103个论文的更大数据集进行了第四次实验，以将我们的系统与Zotero 4.0进行比较。结果表明，我们的系统明显优于Zotera 4.0。因此证明了所提出模型的可行性。研究的局限性/含义-从学术意义上讲，该系统具有两个方面的独特之处：首先，该系统仅将Cora标头集用于HMM培训，而没有使用其他带标签的数据集或地名词典资源，这意味着该系统轻巧且可扩展。其次，该系统是可行的，并且可以应用于提取实际PDF文件的元数据。然后可以将提取的书目数据导入诸如尾注或refworks之类的引文软件中，以提高研究人员的工作效率。实际意义-对于实际意义，该系统可以胜过现有工具Zotera v4.0。这为从业人员提供了在实际应用中开发类似产品的良好机会;尽管可能需要一些有关HMM实施的知识。原创性/价值-HMM的实现并非新颖。创新之处在于它实际上结合了两个HMM模型。主要模型改编自Freitag和Mccallum（1999），作者在其中添加了Nymble HMM的单词特征（Bikel等，1997）。该系统即使在训练模型之前无需手动标记数据集（作者只是使用cora数据集在真实的PDF纸上进行训练和测试）也可以使用，因为这与迄今为止的其他工作有很大不同。实验结果已经证明了我们提出的方法在这方面的可行性。

著录项

来源
《Program: Automated Library and Information Systems》 |2014年第3期|共21页
作者
Wen-Feng Hsiao; Te-Min Chang; Erwin Thomas;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类文化理论;
关键词
Bibliographical information; Hidden Markov Model; Information extraction; PDF documents;

机译：书目信息;隐马尔可夫模型;信息提取;PDF文件;

相似文献

外文文献
中文文献
专利

1. Extracting bibliographical data for PDF documents with HMM and external resources [J] . Wen-Feng Hsiao, Te-Min Chang, Erwin Thomas Program: Automated Library and Information Systems . 2014,第3期

机译：使用HMM和外部资源提取PDF文档的书目数据
2. TEXUS: A unified framework for extracting and understanding tables in PDF documents [J] . Rastan Roya, Paik Hye-Young, Shepherd John Information Processing & Management . 2019,第3期

机译：TEXUS：提取和理解PDF文档中表格的统一框架
3. A New System for Extracting and Detecting Skin Color Regions from PDF Documents [J] . Tarek Abd El-Hafeez International Journal on Computer Science and Engineering . 2010,第9期

机译：从PDF文档中提取和检测肤色区域的新系统
4. Rule Based Approach to Extract Metadata from Scientific PDF Documents [C] . Ahmer Maqsood Hashmi, Muhammad Tanvir Afzal, Sabih ur Rehman International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications . 2020

机译：基于规则的科学PDF文档提取元数据的方法
5. Automatic semantic header generator for PDF documents [D] . Xue, Furong 2004

机译：PDF文档的自动语义头生成器
6. Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians [O] . Majid Jaberi-Douraki, Soudabeh Taghian Dinani, Nuwan Indika Millagaha Gedara, 2021

机译：来自HTML和PDF文件的快速残留检测测定数据的大规模数据挖掘：改善兽医的数据访问和可视化
7. Nations Unies. Human Rights Bibliography: United Nations Documents and Publications, 1980-1990. Genève, ONU, 1994. 5 v. Nations Unies. Human Rights on CD-ROM: Bibliographical Database for United Nations Documents and Publications, 1980-1983. Genève, ONU, 1994. CD-ROM Radio Suisse Romande. La Déclaration universelle des droits de l’homme et la Genève internationale. Genève, Radio Suisse Romande, 1994. CD-ROM. (Les voix de l’histoire) [O] . Brault, Jean-Rémi 1994

机译：联合国。人权书目：联合国文件和出版物，1980-1990。日内瓦，联合国，1994年.5诉。联合国。 CD-ROm上的人权：1980 - 1983年联合国文件和出版物的书目数据库。日内瓦，联合国，1994年。光盘广播瑞士罗曼德。 “世界人权宣言”和日内瓦国际。日内瓦，瑞士广播电台，1994年。光盘。（故事的声音）

Extracting bibliographical data for PDF documents with HMM and external resources

摘要

著录项

相似文献

相关主题

期刊订阅