Automatic Extraction of Titles from General Documents using Machine Learning

机译：使用机器学习自动提取来自一般文件的标题

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.

机译：在本文中，我们提出了一种从一般文件中提取的机器学习方法。通过一般文件，我们的意思是可以属于许多特定类型中的任何一个的文件，包括演示文稿，书籍章节，技术文件，小册子，报告和信件。以前，已经提出了研究论文的标题提取的方法。无论是否有可能从一般文件中进行自动标题提取。作为一个案例研究，我们考虑从包括单词和PowerPoint的办公室提取。在我们的方法中，我们在样本文件中注释标题（分别用于Word和PowerPoint）并将其作为培训数据，火车机学习模型，并使用经过培训的型号执行标题提取。我们的方法是唯一的，因为我们主要利用模型中的格式信息，如模型中的功能。事实证明，格式化信息的使用可能导致常规文档的精确提取。从单词的标题提取的精度和召回分别为0.810和0.837，并且在Intranet数据的实验中分别从PowerPoint提取的精度和召回分别为0.875和0.895。本工作中的其他重要新发现包括我们可以在一个域中培训模型，并将它们应用于另一个域，更令人惊讶的是，我们甚至可以用一种语言培训模型并将它们应用于另一种语言。此外，我们可以通过使用提取的标题显着改善文档检索的搜索排名。

著录项

来源
《ACM/IEEE Joint Conference on Digital Libraries》|2005年||共10页
会议地点
作者
Yunhua Hu; Hang Li; Yunbo Cao; Dmitriy Meyerzon; Qinghua Zheng;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 G250.76-53;
关键词
Information extraction; Metadata extraction; Machine learning; Search;

机译：信息提取;元数据提取;机器学习;搜索;

相似文献

外文文献
中文文献
专利

1. Automatic extraction of titles from general documents using machine learning [J] . Yunhua Hu, Hang Li, Yunbo Cao, Information Processing & Management . 2006,第5期

机译：使用机器学习从一般文档中自动提取标题
2. AUTOMATIC MACHINE LEARNING OF KEYPHRASE EXTRACTION FROM SHORT HTML DOCUMENTS WRITTEN IN HEBREW [J] . YAAKOV HACOHEN-KERNER, ITTAY STERN, DAVID KORKUS, Cybernetics and Systems . 2007,第1期

机译：从希伯来语简短HTML文档中提取关键词的自动机器学习
3. Multi-lingual date field extraction for automatic document retrieval by machine [J] . Mandal Ranju, Roy Partha Pratim, Pal Umapada, Information Sciences: An International Journal . 2015,第Null期

机译：多语言日期字段提取，可通过机器自动检索文档
4. Automatic extraction of titles from general documents using machine learning [C] . Yunhua Hu, Hang Li, Yunbo Cao, Digital Libraries, 2005. JCDL '05 . 2005

机译：使用机器学习从一般文档中自动提取标题
5. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning [D] . Mysore Gopinath, Abhijith Athreya 2018

机译：使用无监督和有监督的学习自动检测HTML文档中的节标题和散文
6. Using machine learning for concept extraction on clinical documents from multiple data sources [O] . Manabu Torii, Kavishwar Wagholikar, Hongfang Liu 2011

机译：使用机器学习从多个数据源提取临床文档的概念
7. Automatic extraction of titles from general documents using machine learning [O] . Yunhua Hu A, Hang Li B, Yunbo Cao B, 2006

机译：使用机器学习从一般文档中自动提取标题

Automatic Extraction of Titles from General Documents using Machine Learning

摘要

著录项

相似文献

相关主题

期刊订阅