首页> 外文期刊>International Journal of Information and Management Sciences >Hybrid Focused Crawling Based Upon VSM Similarity, WordNet Semantics and Hub Score Learning
【24h】

Hybrid Focused Crawling Based Upon VSM Similarity, WordNet Semantics and Hub Score Learning

机译:基于VSM相似度,WordNet语义和集线器分数学习的混合聚焦爬网

获取原文
获取原文并翻译 | 示例
       

摘要

New Websites, together with new Web pages, axe mushrooming in every corner of the world and gigabytes of information is being uploaded, deleted or modified every unit of time. None of the existing search engines is able to cover the complete Web as a whole for indexing due to the ever increasing size and hence is not able to provide complete and latest information all the times. Users still have to sequentially browse the search results to get the desired information. Also sometimes the search results are biased by wiling full access of an unrelated page more times than a related page for some query. Focused crawler provides the solution for growing size of the Web by browsing the portion of the Web that is related to the specific domain. It covers the maximum Web space looking for the contents related to the domain and provides the more recent and exact information. In this paper we present a focused crawler architecture based upon WordNet semantics, Vector Space Model (VSM) and hub score learning. Crawling results for breadth first crawler, VSM based best first crawler, Naive Bayes breadth first crawler, Naive Bayes best first crawler, and crawler based upon WordNet semantics, Vector Space Model (VSM) and hub score learning, are shown. The results show that the proposed crawler outperforms the others in terms of the precision and also outperform all but Naive Bayes breadth first crawler, which produces the worst precision among all the competitors, in terms of average time taken for collecting 1000 domain related pages.
机译:新的网站以及新的网页在世界的每一个角落如雨后春笋般涌现,并且每单位时间都会上传,删除或修改千兆字节的信息。由于规模的不断扩大,现有的搜索引擎都无法覆盖整个Web进行索引,因此无法始终提供完整和最新的信息。用户仍然必须顺序浏览搜索结果才能获得所需的信息。同样,有时对于某些查询,通过对不相关页面的完全访问要比对相关页面的完全访问产生更多的偏见。重点爬网程序通过浏览Web中与特定域相关的部分,为扩大Web规模提供了解决方案。它涵盖了最大的Web空间,用于查找与该域相关的内容,并提供了最新和准确的信息。在本文中,我们提出了一种基于WordNet语义,向量空间模型(VSM)和中心评分学习的集中式爬虫体系结构。显示了广度优先搜寻器,基于VSM的最佳优先搜寻器,Naive Bayes广度优先的搜寻器,Naive Bayes最佳优先搜寻器以及基于WordNet语义,向量空间模型(VSM)和集线器分数学习的搜寻器的搜寻结果。结果表明,在收集1000个与域相关的页面所需的平均时间方面,拟议的爬虫在精度方面优于其他爬虫,而且也优于除Naive Bayes广度优先爬虫以外的所有爬虫,这在所有竞争对手中产生的精度最差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号