首页> 中文期刊> 《计算机应用与软件》 >基于分块的新闻网页信息抽取算法

基于分块的新闻网页信息抽取算法

     

摘要

To more thoroughly purge the noises in webpage and reduce the effect of webpage noises on accuracy of news content extraction, we propose two cleaning methods,the template page-based cleaning method for same noise blocks and the class attribute-based cleaning methodfor similar noise blocks and special noise blocks;based on that,by using the characteristic of webpage of news in contents layout structure,we present the beginning block and end block-based news content extraction algorithm.Experimental results show that compared with existing algorithm,the proposed algorithm has higher extraction accuracy rate and can adapt to the situation that the text content is stored in either single block or multiple blocks,and it effectively solves the extraction problem of shorter text content.%为了更彻底地清洗网页噪音,减少网页噪音对新闻内容抽取准确率的影响,提出基于模板页的相同噪音块清洗方法和基于class属性的同类噪音块和特殊噪音块清洗方法;在此基础上,利用新闻网页在内容布局结构上的特征,提出基于起始块和终止块的新闻内容抽取方法。实验结果表明,与已有的算法相比,提出的方法抽取准确率更高,能够同时适应正文内容存放在单块和多块的情形,并且有效地解决了正文内容较短时的抽取问题。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号