首页> 外文会议>Machine learning and data mining in pattern recognition >A General Learning Method for Automatic Title Extraction from HTML Pages
【24h】

A General Learning Method for Automatic Title Extraction from HTML Pages

机译:从HTML页面自动提取标题的通用学习方法

获取原文
获取原文并翻译 | 示例

摘要

This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties.rnWe construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques.rnBased on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.
机译:本文解决了从HTML文档自动学习标题元数据的问题。目的是帮助索引注释不正确的Web资源。其他作品提出了类似的目标,但他们只考虑了文本格式的标题。在本文中,我们提出了一种通用的学习模式,该模式允许根据样式信息学习文本标题,并根据图像属性学习图像格式标题。本文详细介绍了语料库的创建方法和信息提取技术。基于这些特征,尽管语料库具有异质性,但应用学习算法(例如决策树和随机森林算法)仍能取得良好的效果,我们还展示了两种方法的结合可以带来更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号