首页> 外文会议>ACM/IEEE Joint Conference on Digital Libraries >Automatic Extraction of Titles from General Documents using Machine Learning
【24h】

Automatic Extraction of Titles from General Documents using Machine Learning

机译:使用机器学习自动提取来自一般文件的标题

获取原文

摘要

In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.
机译:在本文中,我们提出了一种从一般文件中提取的机器学习方法。通过一般文件,我们的意思是可以属于许多特定类型中的任何一个的文件,包括演示文稿,书籍章节,技术文件,小册子,报告和信件。以前,已经提出了研究论文的标题提取的方法。无论是否有可能从一般文件中进行自动标题提取。作为一个案例研究,我们考虑从包括单词和PowerPoint的办公室提取。在我们的方法中,我们在样本文件中注释标题(分别用于Word和PowerPoint)并将其作为培训数据,火车机学习模型,并使用经过培训的型号执行标题提取。我们的方法是唯一的,因为我们主要利用模型中的格式信息,如模型中的功能。事实证明,格式化信息的使用可能导致常规文档的精确提取。从单词的标题提取的精度和召回分别为0.810和0.837,并且在Intranet数据的实验中分别从PowerPoint提取的精度和召回分别为0.875和0.895。本工作中的其他重要新发现包括我们可以在一个域中培训模型,并将它们应用于另一个域,更令人惊讶的是,我们甚至可以用一种语言培训模型并将它们应用于另一种语言。此外,我们可以通过使用提取的标题显着改善文档检索的搜索排名。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号