基于贝叶斯分类的主题爬虫

         

摘要

随着网络的高速发展,其信息资源越来越庞大,面对巨量的信息库,搜索引擎起着重要的作用。主题爬虫技术作为搜索引擎的主要核心部分,计算搜索结果与搜索主题的关系,该关系被称为相关性。一般主题爬虫方法只计算网页内容与搜索主题的相关性,作者所提主题爬虫,通过链接内容和锚文本内容计算链接的重要性,然后利用贝叶斯分类器对链接进行分类,最后利用余弦相似函数计算网页的相关性,如果相关值大于阀值,则认为该网页与预定主题相关,否则不相关。实验结果证明:所提出主题爬虫方法可以获得很高的精确度。%With the rapid development of the network,its information resources are increasingly large and faced a huge amount of information database, search engine plays an important role. Focused crawling technique, as the main core portion of search engine,is used to calculate the relationship between search results and search topics,which is called correlation. Normally,focused crawling method is used only to calculate the correlation between web content and search related topics. In this paper, focused crawling method is used to compute the importance of links through link content and anchor text,then Bayesian classifier is used to classify the links,and finally cosine similarity function is used to calculate the relevance of web pages. If the correlation value is greater than the threshold the page is considered to be associated with the predetermined topics, otherwise not relevant. Experimental results show that a high accuracy can be obtained by using the proposed crawling approach.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号