...
【24h】

An adaptive focused Web crawling algorithm based on learning automata

机译:基于学习自动机的自适应聚焦Web爬行算法

获取原文
获取原文并翻译 | 示例

摘要

The recent years have witnessed the birth and explosive growth of the Web. The exponential growth of the Web has made it into a huge source of information wherein finding a document without an efficient search engine is unimaginable. Web crawling has become an important aspect of the Web search on which the performance of the search engines is strongly dependent. Focused Web crawlers try to focus the crawling process on the topicrelevantWeb documents. Topic oriented crawlers are widely used in domain-specific Web search portals and personalized search tools. This paper designs a decentralized learning automata-based focused Web crawler. Taking advantage of learning automata, the proposed crawler learns the most relevant URLs and the promising paths leading to the target on-topic documents. It can effectively adapt its configuration to the Web dynamics. This crawler is expected to have a higher precision rate because of construction a small Web graph of only on-topic documents. Based on the Martingale theorem, the convergence of the proposed algorithm is proved. To show the performance of the proposed crawler, extensive simulation experiments are conducted. The obtained results show the superiority of the proposed crawler over several existing methods in terms of precision, recall, and running time. The t-test is used to verify the statistical significance of the precision results of the proposed crawler.
机译:近年来见证了Web的诞生和爆炸式增长。 Web的指数级增长已使其成为庞大的信息源,而在没有高效搜索引擎的情况下查找文档是不可想象的。 Web爬网已成为Web搜索的重要方面,搜索引擎的性能在很大程度上依赖于此。重点突出的Web搜寻器尝试将搜寻过程重点放在与主题相关的Web文档上。面向主题的搜寻器广泛用于特定于域的Web搜索门户和个性化搜索工具。本文设计了一种基于分散学习自动机的重点Web搜寻器。利用学习自动机的优势,建议的搜寻器学习最相关的URL和通往目标主题文档的有前途的路径。它可以有效地使其配置适应Web动态。由于构建仅包含主题文档的小型Web图形,因此预期该搜寻器具有更高的准确率。基于the定理,证明了该算法的收敛性。为了显示所提出的履带的性能,进行了广泛的仿真实验。获得的结果表明,在精度,查全率和运行时间方面,所提出的爬虫优于几种现有方法。 t检验用于验证所提出的爬虫精度结果的统计意义。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号