首页> 外文会议>International conference on World wide web >News article extraction with template-independent wrapper
【24h】

News article extraction with template-independent wrapper

机译:使用与模板无关的包装程序提取新闻文章

获取原文

摘要

We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
机译:我们考虑与模板无关的新闻提取问题。最新的新闻提取方法基于模板级包装器归纳,这有两个严重的局限性。 1)在生成该模板的包装程序之前,它无法正确提取属于看不到的模板的页面。 2)维护数百个网站的最新包装是昂贵的,因为模板的任何更改都可能导致相应包装的失效。在本文中,我们将新闻提取形式化为机器学习问题,并使用来自单个站点的极少量带标签的新闻页面来学习独立于模板的包装器。分别开发了新闻标题和正文的新颖功能。利用新闻标题和新闻正文之间的相关性。无论模板如何,我们独立于模板的包装程序都可以从不同站点提取新闻页面。在实验中,从单个新闻站点的40页中学习了一个包装器。在12个新闻站点的3,973个新闻页面上,它的准确性达到98.1%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号