首页> 外文会议>IEEE Asia-Pacific Services Computing Conference >A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm
【24h】

A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm

机译:使用PageRank算法互联网搜索分布式履带的网挖架构模型

获取原文

摘要

As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search engine based architectural model for people to search through the Web. Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. In this paper, we describe the design of a web crawler that uses PageRank algorithm for distributed searches and can be run on a network of workstations. The crawler scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present web mining architecture of the system and describe efficient techniques for achieving high performance.
机译:由于万维网在迅速增长,并且当前的数据方案中的数据以分布式方式存储。需要开发一种基于搜索引擎的架构模型,供人们搜索网络。广泛的网络搜索引擎以及更多专业搜索工具依赖于Web爬网程序来获取大型页面,以进行索引和分析。爬虫是Web搜索引擎的重要模块。履带器的质量直接影响了这种网络搜索引擎的搜索质量。这种Web履带可以在几周或几个月内与数百万个宿主相互作用,因此鲁棒性,灵活性和可管理性的问题具有重要意义。鉴于一些URL,爬虫应该检索那些URL的网页,解析HTML文件,将新URL添加到其队列中,然后返回此循环的第一阶段。遗留程序还可以从HTML文件中检索一些其他信息,因为它会解析它们以获取新URL。在本文中,我们描述了一种Web爬虫的设计,它使用PageRank算法进行分布式搜索,并且可以在工作站网络上运行。爬网程序缩放到每秒数百页,是针对系统崩溃和其他事件的弹性,并且可以适应各种爬行应用。我们提供了系统的网络挖掘架构,并描述了实现高性能的有效技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号