首页> 外文会议>International Conference on Rough Sets and Knowledge Technology >Learning to Extract Web News Title in Template Independent Way
【24h】

Learning to Extract Web News Title in Template Independent Way

机译:学习以模板独立方式提取网络新闻标题

获取原文

摘要

Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is an important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machine learning approach to tackle this problem. Our approach is independent of templates and thus will not suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.
机译:许多新闻网站有大量的新闻页面从底层数据库动态而无休止地生成。因此,新闻标题的自动提取新闻标题和内容是新闻聚合系统等应用的重要技术。但是,从各种样式的新闻页面中准确地提取新闻标题是在以前的工作中成为一个具有挑战性的任务。在本文中,我们提出了一种机器学习方法来解决这个问题。我们的方法与模板无关,因此不会遭受模板的更新,这些模板通常使相应的提取器无效。我们的方法的实证评估超过13个重要的在线新闻网站收集的5,200个新闻网页表明,我们的方法显着提高了新闻标题提取的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号