A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm

机译：使用PageRank算法互联网搜索分布式履带的网挖架构模型

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search engine based architectural model for people to search through the Web. Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. In this paper, we describe the design of a web crawler that uses PageRank algorithm for distributed searches and can be run on a network of workstations. The crawler scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present web mining architecture of the system and describe efficient techniques for achieving high performance.

机译：由于万维网在迅速增长，并且当前的数据方案中的数据以分布式方式存储。需要开发一种基于搜索引擎的架构模型，供人们搜索网络。广泛的网络搜索引擎以及更多专业搜索工具依赖于Web爬网程序来获取大型页面，以进行索引和分析。爬虫是Web搜索引擎的重要模块。履带器的质量直接影响了这种网络搜索引擎的搜索质量。这种Web履带可以在几周或几个月内与数百万个宿主相互作用，因此鲁棒性，灵活性和可管理性的问题具有重要意义。鉴于一些URL，爬虫应该检索那些URL的网页，解析HTML文件，将新URL添加到其队列中，然后返回此循环的第一阶段。遗留程序还可以从HTML文件中检索一些其他信息，因为它会解析它们以获取新URL。在本文中，我们描述了一种Web爬虫的设计，它使用PageRank算法进行分布式搜索，并且可以在工作站网络上运行。爬网程序缩放到每秒数百页，是针对系统崩溃和其他事件的弹性，并且可以适应各种爬行应用。我们提供了系统的网络挖掘架构，并描述了实现高性能的有效技术。

著录项

来源
《IEEE Asia-Pacific Services Computing Conference》|2008年||共6页
会议地点
作者
Tripathy Animesh; Patra Prashanta K.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301-53;
关键词
Crawler; Data Mining; PageRank; Web Mining;

机译：履带;数据挖掘;PageRank;网站挖掘;

相似文献

外文文献
中文文献
专利

1. A Web Aggregation Approach for Distributed Randomized PageRank Algorithms [J] . Ishii H., Tempo R., Bai E.-W. Automatic Control, IEEE Transactions on . 2012,第11期

机译：分布式随机PageRank算法的Web聚合方法
2. Improved PageRank algorithm using structural web mining techniques and bee colony algorithm [J] . Helen Namaki, Ali Harounabadi, Seyed Javad Mirabedini International journal of computer science and network security . 2017,第10期

机译：使用结构化网络挖掘技术和蜂群算法的改进的PageRank算法
3. Block matrix-based marpreduce pagerank algorithm web structure mining applied effect research [J] . Weizhong Yan BioTechnology: An Indian Journal . 2014,第5期

机译：基于块矩阵的Marpreduce Pagerank算法Web结构挖掘应用效果研究
4. A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm [C] . Tripathy Animesh, Patra Prashanta K. IEEE Asia-Pacific Services Computing Conference . 2008

机译：使用PageRank算法互联网搜索分布式履带的网挖架构模型
5. Computing principal eigenvectors of large web graphs: Algorithms and accelerations related to pagerank and hits. [D] . Nagasinghe, Iranga. 2010

机译：计算大型网络图的主要特征向量：与Pagerank和Hits相关的算法和加速。
6. Mining Social Media and Web Searches For Disease Detection [O] . Y. Tony Yang, Michael Horneffer, Nicole DiLisio 2013

机译：挖掘社交媒体和网络搜索以进行疾病检测
7. A web aggregation approach for distributed randomized PageRank algorithms [O] . Hideaki Ishii, Roberto Tempo, Er-wei Bai 2012

机译：分布式随机pageRank算法的Web聚合方法

A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅