首页> 外文会议>Rough sets and knowledge technology >Learning to Extract Web News Title in Template Independent Way
【24h】

Learning to Extract Web News Title in Template Independent Way

机译:学习以模板独立方式提取Web新闻标题

获取原文
获取原文并翻译 | 示例

摘要

Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is an important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machine learning approach to tackle this problem. Our approach is independent of templates and thus will not suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.
机译:许多新闻站点都有大量的新闻页面,这些新闻页面是从基础数据库动态不断地生成的。因此,从新闻页面自动提取新闻标题和内容是新闻聚合系统等应用程序的一项重要技术。但是,在以前的工作中,从各种样式的新闻页面中准确地提取新闻标题是一项艰巨的任务。在本文中,我们提出了一种机器学习方法来解决这个问题。我们的方法独立于模板,因此不会受到模板更新(通常会使相应提取程序无效)的困扰。对从13个重要的在线新闻站点收集的超过5,200个新闻网页的方法进行的经验评估表明,我们的方法显着提高了新闻标题提取的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号