【24h】

Scalability and Robustness Testing for Open Source Web Crawlers

机译:开源Web爬虫的可扩展性和鲁棒性测试

获取原文

摘要

This paper implemented the proposed framework. It focuses on evaluating the crawlers based on scalability and robustness on e-commerce websites. The scalability is a feature that the system can adapt to the amount of data continuing to increase, and the performance does not decrease. The robustness is an ability that can handle exceptions when web crawlers are crawling. Multiple testing environments were set up on e-commerce websites. Scalability testing and robustness testing were used to measure the scalability and robustness of web crawlers. The scalability attributes and robustness failure rate were used to quantify the scalability and robustness. Statistical methods such as the Friedman test and the Nemenyi test were used to analyze the significant differences among crawlers. The experimental results show Heritrix, Scrapy, and Nutch have the best overall scalability. In the non-interference test, Scrapy has the best robustness. However, Webmagic, Webcolletor, and Gecco have the best robustness in the interference test based on general test and database test.
机译:本文实施了拟议的框架。它侧重于根据电子商务网站的可扩展性和稳健性来评估爬虫。可伸缩性是系统可以适应继续增加的数据量,并且性能不会降低。鲁棒性是一种能够在Web爬虫爬行时处理异常的能力。在电子商务网站上建立了多个测试环境。可扩展性测试和稳健性测试用于测量Web爬虫的可扩展性和鲁棒性。可伸缩性属性和鲁棒性故障率用于量化可扩展性和鲁棒性。统计方法如弗里德曼测试和Nemenyi试验用于分析爬行物之间的显着差异。实验结果显示赫里斯蒂斯,Scrapy和Nutch具有最好的整体可扩展性。在非干扰测试中,Scrapy具有最佳的稳健性。但是,基于常规测试和数据库测试,WebMagic,Webcolletor和Gecco在干扰测试中具有最佳稳健性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号