首页> 外文会议>International Conference on Computing Communication Control and Automation >Parallel Crawling for Detection and Removal of DUST using DUSTER
【24h】

Parallel Crawling for Detection and Removal of DUST using DUSTER

机译:用除尘器检测和去除灰尘的平行爬行

获取原文

摘要

Web is commonly used medium to search informa-tion using Web Crawler. Web crawler fetches different pages related to given keyword but some of them contains duplicate content. Different URLs with similar text are DUST. To im-prove performance of search engine, DUSTER method is used. DUSTER detects and removes duplicate URLs without fetching their contents. Single crawler crawls single URL at a time. Multiple URLs are crawled parallally by Parallel crawlers and the results of parallel cralwlers are combined and given as a input to the DUSTER. Multiple sequence alignment is used to generate candidate rules and rules of validation. Then the candidate rules filtered out according to their performance in a validation set and finally removes the duplicate URLs. Using this method reduction of large number of duplicate URLs is achieved.
机译:Web通常使用媒体使用Web爬网程序搜索信息。 Web爬网程序获取与给定关键字相关的不同页面,但其中一些包含重复内容。具有类似文本的不同URL是灰尘。为了IM-证明搜索引擎的性能,使用Duster方法。除尘器检测并删除重复的URL而不获取其内容。单个爬网手一次爬网。通过并行爬行器逐渐爬行多个URL,并将并行CRALWLERS的结果组合并作为Duster的输入给出。多个序列对齐用于生成候选规则和验证规则。然后候选规则根据其在验证集中的性能进行过滤,最后删除重复的URL。使用此方法实现了大量重复URL的减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号