...
首页> 外文期刊>International journal of computational intelligence research >Effective Detection of Near Duplicate Web Documents in Web Crawling
【24h】

Effective Detection of Near Duplicate Web Documents in Web Crawling

机译:在Web爬网中有效检测几乎重复的Web文档

获取原文
获取原文并翻译 | 示例

摘要

In recent times, the concept of Web Crawling has received remarkable significance owing to the drastic development of the World Wide Web. Huge challenges have been posed by the voluminous amounts of web documents swarming the web to the web search engines making their less appropriate to the users. Additional overheads are created for the search engines by the presence of duplicate and near duplicate web documents in abundance, by which their performance and quality is significantly affected. The web crawling research community has extensively recognized the detection of duplicate and near duplicate web pages. Providing the users with pertinent results for their queries in the first page without duplicate and redundant results is a vital requisite. We have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling in this paper. The near duplicate web pages are detected followed by the storage of crawled web pages in to repositories. The keywords are extracted from the crawled pages initially and on the basis of the extracted keywords, the similarity score between the two pages is calculated. The documents are considered as near duplicates if its similarity scores are lesser than a threshold value. Memory for repositories has been reduced and the search engine quality has been improved owing to the detection.
机译:近年来,由于万维网的迅猛发展,Web爬行的概念已收到了显着的意义。大量的网络文档给网络搜索引擎带来了巨大的挑战,使网络不适合用户。大量重复和几乎重复的Web文档的存在为搜索引擎带来了额外的开销,从而大大影响了它们的性能和质量。网络搜寻研究社区已广泛认识到检测重复网页和几乎重复的网页。在第一页为用户提供有关其查询的相关结果而又没有重复和多余的结果是至关重要的。在本文中,我们提出了一种新颖有效的方法来检测网络爬网中几乎重复的网页。检测到几乎重复的网页,然后将爬网的网页存储到存储库中。最初从爬网的页面中提取关键字,并基于提取的关键字来计算两个页面之间的相似度得分。如果文档的相似性分数小于阈值,则将其视为接近重复。由于检测,存储库的内存已减少,搜索引擎的质量得到了提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号