首页> 外文会议>IEEE International Workshop on Semantic Computing and Systems >Incremental Web Page Template Detection by Text Segments
【24h】

Incremental Web Page Template Detection by Text Segments

机译:通过文本段的增量网页模板检测

获取原文

摘要

Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of web pages are required as input. So they usually process web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we don't need to cache any web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.
机译:模板检测技术对于许多应用是重要的。大多数模板检测方法利用内容重复作为一个提示,以检测模板块,即许多网页是输入的输入。因此,它们通常会批处理网页,即在收集足够的页面之前无法处理新爬网页面。这消耗了大量存储消耗来缓存网页并导致数据刷新的巨大延迟。在本文中,我们介绍了一个增量框架,以检测一下在爬网后处理页面的模板。在此框架下,我们不需要缓存任何网页。实验表明,我们的框架比传统方法消耗不到7%的存储。并且还完全消除了批处理过程引起的数据刷新的延迟。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号