首页> 中文期刊> 《计算机工程与应用》 >基于结构一致和特征学习的网页信息标签提取

基于结构一致和特征学习的网页信息标签提取

         

摘要

The Web information refers to the special contents of the Web pages which usually includes main body, title, release date and release media. Each content is put in the corresponding HTML tags. Extracting automatically such tags is able to obtain Web information under the same Web template. Such tags extraction for Web information is a great help for clawing contents from a large number of Web pages. Since Web structure consistency for the same template and the statis-tical features of Web information, this paper proposes tags extraction automatically for Web information based on struc-ture consistency and feature learning. The algorithm consists of three steps:Web contrast, content identification and tags extraction. Experimental results on 51 Web templates from 1620 Web pages show that the proposed algorithm achieves Web information extraction not only high-speed but also high-accuracy.%网页信息指网页的正文、标题、发布时间、媒体等,每个信息都存在于HTML文档特定的标签中,自动获取这些标签可以实现在相同模板下的网页信息自动提取,对于大规模抓取网页内容有很大帮助.由于在相同模板下不同网页之间结构一致,网页信息有一定统计特征,提出了一种基于结构对比和特征学习的网页信息标签自动提取算法.该算法包含三个步骤:网页对比、内容识别和标签提取.在51个模块下对1620个网页进行测试,实验结果表明,通过提取标签获取网页信息不仅速度快,而且抓取的内容更加准确.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号