【24h】

Learning to Extract Content from News Webpages

机译:学习从新闻网页中提取内容

获取原文

摘要

We consider the problem of content extraction from online news Web pages. To explore to what extent the syntactic markup and the visual structure of a Web page facilitate the extraction of its content, we compare two state-of-the-art classifiers as first instantiations of a general framework that allows for proper model comparison. To this end, we introduce the publicly available NEWS600 corpus, a set of 604 real world news Web pages which have been annotated with 30 semantic labels. An empirical analysis of the two models on this dataset shows that the inclusion of structural information is indeed advantageous.
机译:我们考虑从在线新闻网页提取内容的问题。为了探究网页的句法标记和视觉结构在多大程度上有助于其内容的提取,我们将两个最先进的分类器作为通用框架的第一个实例进行比较,以进行适当的模型比较。为此,我们介绍了可公开获得的NEWS600语料库,它是604个真实世界新闻网页的集合,并带有30个语义标签。对这个数据集上的两个模型的经验分析表明,包含结构信息确实是有利的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号