Hybrid Focused Crawling Based Upon VSM Similarity, WordNet Semantics and Hub Score Learning

Mukesh Kumar; Renu Vig

首页> 外文期刊>International Journal of Information and Management Sciences >Hybrid Focused Crawling Based Upon VSM Similarity, WordNet Semantics and Hub Score Learning

【24h】

Hybrid Focused Crawling Based Upon VSM Similarity, WordNet Semantics and Hub Score Learning

机译：基于VSM相似度，WordNet语义和集线器分数学习的混合聚焦爬网

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

New Websites, together with new Web pages, axe mushrooming in every corner of the world and gigabytes of information is being uploaded, deleted or modified every unit of time. None of the existing search engines is able to cover the complete Web as a whole for indexing due to the ever increasing size and hence is not able to provide complete and latest information all the times. Users still have to sequentially browse the search results to get the desired information. Also sometimes the search results are biased by wiling full access of an unrelated page more times than a related page for some query. Focused crawler provides the solution for growing size of the Web by browsing the portion of the Web that is related to the specific domain. It covers the maximum Web space looking for the contents related to the domain and provides the more recent and exact information. In this paper we present a focused crawler architecture based upon WordNet semantics, Vector Space Model (VSM) and hub score learning. Crawling results for breadth first crawler, VSM based best first crawler, Naive Bayes breadth first crawler, Naive Bayes best first crawler, and crawler based upon WordNet semantics, Vector Space Model (VSM) and hub score learning, are shown. The results show that the proposed crawler outperforms the others in terms of the precision and also outperform all but Naive Bayes breadth first crawler, which produces the worst precision among all the competitors, in terms of average time taken for collecting 1000 domain related pages.

机译：新的网站以及新的网页在世界的每一个角落如雨后春笋般涌现，并且每单位时间都会上传，删除或修改千兆字节的信息。由于规模的不断扩大，现有的搜索引擎都无法覆盖整个Web进行索引，因此无法始终提供完整和最新的信息。用户仍然必须顺序浏览搜索结果才能获得所需的信息。同样，有时对于某些查询，通过对不相关页面的完全访问要比对相关页面的完全访问产生更多的偏见。重点爬网程序通过浏览Web中与特定域相关的部分，为扩大Web规模提供了解决方案。它涵盖了最大的Web空间，用于查找与该域相关的内容，并提供了最新和准确的信息。在本文中，我们提出了一种基于WordNet语义，向量空间模型（VSM）和中心评分学习的集中式爬虫体系结构。显示了广度优先搜寻器，基于VSM的最佳优先搜寻器，Naive Bayes广度优先的搜寻器，Naive Bayes最佳优先搜寻器以及基于WordNet语义，向量空间模型（VSM）和集线器分数学习的搜寻器的搜寻结果。结果表明，在收集1000个与域相关的页面所需的平均时间方面，拟议的爬虫在精度方面优于其他爬虫，而且也优于除Naive Bayes广度优先爬虫以外的所有爬虫，这在所有竞争对手中产生的精度最差。

著录项

来源
《International Journal of Information and Management Sciences》 |2013年第3期|249-263|共15页
作者
Mukesh Kumar; Renu Vig;
展开▼
作者单位

Department of Computer Science and Engineering, University Institute of Engineering and Technology, Panjab University, Chandigarh, India;

Department of Electronics and Communication Engineering, University Institute of Engineering and Technology, Panjab University, Chandigarh, India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Information retrieval; World Wide Web (WWW); data mining; search engines;

机译：信息检索;万维网（WWW）;数据挖掘;搜索引擎;
入库时间 2022-08-18 00:40:09

相似文献

外文文献
中文文献
专利

1. Focused Crawling Based Upon TF-IDF Semantics and Hub Score Learning [J] . Mukesh Kumar, Renu Vig Journal of Emerging Technologies in Web Intelligence . 2013,第1期

机译：基于TF-IDF语义和中心得分学习的集中爬网
2. A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet [J] . Cai Yuanyuan, Zhang Qingchuan, Lu Wei, Journal of Intelligent Information Systems . 2018,第1期

机译：WordNet中基于IC加权路径距离的语义相似度混合测量方法
3. A New Hybrid Improved Method for Measuring Concept Semantic Similarity in WordNet [J] . Zhang Xiaogang, Sun Shouqian, Zhang Kejun The international arab journal of information technology . 2020,第4期

机译：一种新的混合改进方法，用于在Wordnet中测量概念语义相似性
4. Semantic Similarity Based on Word Recurrence Ratio Focusing on WordNet [C] . M. Elius Ahmed, Md Shajalal, Md. Atabuzzaman, International Conference on Computer and Information Technology . 2020

机译：基于Word复发率的语义相似性聚焦在Wordnet上
5. New semantic similarity techniques of concepts applied in the biomedical domain and WordNet. [D] . Nguyen, Hoa A. 2006

机译：在生物医学领域和WordNet中应用的概念的新语义相似性技术。
6. STS-NLSP: A Network-Based Label Space Partition Method for Predicting the Specificity of Membrane Transporter Substrates Using a Hybrid Feature of Structural and Semantic Similarity [O] . Xiangeng Wang, Xiaolei Zhu, Mingzhi Ye, 2019

机译：STS-NLSP：基于网络的标签空间划分方法使用结构和语义相似性的混合特征预测膜转运蛋白底物的特异性
7. Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus [O] . Issa Atoum, Ahmed Otoom 2016

机译：使用WordNet和语料库的高效混合语义文本相似性
8. Semantics-Based Reference Resolution in Technical Text Processing: An Exploration of Using the WordNet Database in the Computerized Comprehensibility System. [R] . Kieras, D. E. 1992

机译：基于语义的技术文本处理参考分辨率：在计算机化可理解系统中使用WordNet数据库的探索。

Hybrid Focused Crawling Based Upon VSM Similarity, WordNet Semantics and Hub Score Learning

摘要

著录项

相似文献

相关主题

期刊订阅