首页> 外文会议>International conference on advanced data mining and applications;ADMA 2010 >Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling
【24h】

Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling

机译:修复在Web爬网中有效检测几乎重复的Web文档的阈值

获取原文

摘要

The drastic development of the WWW in recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality which have to be removed to provide users with the relevant results for their queries. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling where the keywords are extracted from the crawled pages and the similarity score between two pages is calculated. The documents having similarity score greater than a threshold value are considered as near duplicates. In this paper we have fixed the threshold value.
机译:近年来,WWW的迅猛发展使得Web爬网的概念具有显着的意义。大量的网络文档使网络搜索引擎面临巨大挑战,从而使其搜索结果与用户的相关性降低。大量重复和几乎重复的Web文档的存在给搜索引擎带来了额外的开销,严重影响了它们的性能和质量,必须删除这些开销才能为用户提供相关的查询结果。在本文中,我们提出了一种新颖有效的方法,用于在Web爬网中检测几乎重复的网页,其中从爬网的页面中提取关键字并计算两个页面之间的相似度得分。相似度得分大于阈值的文档被视为接近重复。在本文中,我们已固定阈值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号