首页> 外文会议>2010 International Conference on Web Information Systems and Mining >Automatic Web News Content Extraction Based on Similar Pages
【24h】

Automatic Web News Content Extraction Based on Similar Pages

机译:基于相似页面的自动Web新闻内容提取

获取原文
获取外文期刊封面目录资料

摘要

Today most news pages are generated from some underlying structured source, so we think that template-dependent wrappers should be more suitable for them than template-independent wrappers. In this paper, we propose a novel automatic template-dependent Web news content extraction approach based on similar pages. Firstly, We choose two similar pages as training samples and represent them as two HTML DOM trees. Secondly, we create the maximum matching tree between the DOM trees using our simple tree matching and backtracking algorithm. Then, by analyzing the characteristics of nodes in the maximum matching tree, we eliminate the noise nodes to generate an extraction template. Finally, we build a template-dependent wrapper for target news pages whose structures are similar to the samples. Experimental results indicate that our approach is effective and efficient for Web news content extraction, and the average harmonic mean of precision and recall reaches 98.3% .
机译:如今,大多数新闻页面都是从某些底层结构化来源生成的,因此我们认为,与模板无关的包装器比与模板无关的包装器更适合它们。在本文中,我们提出了一种基于相似页面的新颖的依赖模板的自动Web新闻内容提取方法。首先,我们选择两个相似的页面作为训练样本,并将它们表示为两个HTML DOM树。其次,我们使用简单的树匹配和回溯算法在DOM树之间创建最大匹配树。然后,通过分析最大匹配树中节点的特征,我们消除了噪声节点以生成提取模板。最后,我们为目标新闻页面构建模板相关的包装程序,这些目标新闻页面的结构与示例相似。实验结果表明,该方法对网络新闻内容的提取是有效和高效的,其准确率和查全率的平均谐波均值达到98.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号