首页> 外文OA文献 >Machine Learning for digital document processing: from layout analysis to metadata extraction
【2h】

Machine Learning for digital document processing: from layout analysis to metadata extraction

机译:用于数字文档处理的机器学习:从布局分析到元数据提取

摘要

In the last years, the spread of computers and the Internet caused a significant amount of documents to be available in digital format. Collecting them in digital repositories raised problems that go beyond simple acquisition issues, and cause the need to organize and classify them in order to improve the effectiveness and efficiency of the retrieval procedure. The success of such a process is tightly related to the ability of understanding the semantics of the document components and content. Since the obvious solution of manually creating and maintaining an updatedudindex is clearly infeasible, due to the huge amount of data under consideration,udthere is a strong interest in methods that can provide solutions for automaticallyudacquiring such a knowledge. This work presents a framework that intensively exploits intelligent techniques to support different tasks of automatic document processing from acquisition to indexing, from categorization to storing and retrieval.udThe prototypical version of the system DOMINUS is presented, whose main characteristic is the use of a Machine Learning Server, a suite of different inductiveudlearning methods and systems, among which the more suitable for each specific documentudprocessing phase is chosen and applied. The core system is the incrementaludfirst-order logic learner INTHELEX. Thanks to incrementality, it can continuously update and refine the learned theories, dynamically extending its knowledge to handle even completely new classes of documents.udSince DOMINUS is general and flexible, it can be embedded as a document management engine into many different Digital Library systems. Experiments in a real-world domain scenario, scientific conference management, confirmed the goodudperformance of the proposed prototype.
机译:在过去的几年中,计算机和Internet的普及导致大量文档以数字格式提供。将它们收集在数字存储库中会带来一些超出简单获取问题的问题,并导致需要对它们进行组织和分类以提高检索过程的有效性和效率。这种过程的成功与理解文档组件和内容的语义的能力紧密相关。由于考虑到大量数据,手动创建和维护更新的 udindex的明显解决方案显然是行不通的,因此 u003c u003c u003c u003b u003c u003b u003c u003b u003c u003b u003c u003c u003c u003c u003c u003c u200b u003c u200b u003c u200b u003c u200b u003c u200b u003c u200b, u003c u200b u003c u200b u003c u200b u003c u200b u200b u003c u200b, u003c u200b u003c u200b u003c u200b u003c u200b,对可以提供用于自动 udacquired此类知识的解决方案的方法非常感兴趣。这项工作提出了一个框架,该框架大力利用智能技术来支持从获取到索引,从分类到存储和检索的自动文档处理的不同任务。 ud介绍了系统DOMINUS的原型版本,其主要特征是使用机器学习服务器,一套不同的归纳/学习方法和系统,在其中选择并应用更适合每个特定文档/处理过程的阶段。核心系统是增量 ud一阶逻辑学习器INTHELEX。由于具有增量性,它可以不断地更新和完善所学的理论,动态地扩展其知识以处理甚至全新的文档类别。 ud由于DOMINUS具有通用性和灵活性,因此可以作为文档管理引擎嵌入到许多不同的数字图书馆系统中。在实际领域中进行的实验(科学会议管理)证实了所提出原型的良好性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号