Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling

机译：修复在Web爬网中有效检测几乎重复的Web文档的阈值

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The drastic development of the WWW in recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality which have to be removed to provide users with the relevant results for their queries. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling where the keywords are extracted from the crawled pages and the similarity score between two pages is calculated. The documents having similarity score greater than a threshold value are considered as near duplicates. In this paper we have fixed the threshold value.

机译：近年来，WWW的迅猛发展使得Web爬网的概念具有显着的意义。大量的网络文档使网络搜索引擎面临巨大挑战，从而使其搜索结果与用户的相关性降低。大量重复和几乎重复的Web文档的存在给搜索引擎带来了额外的开销，严重影响了它们的性能和质量，必须删除这些开销才能为用户提供相关的查询结果。在本文中，我们提出了一种新颖有效的方法，用于在Web爬网中检测几乎重复的网页，其中从爬网的页面中提取关键字并计算两个页面之间的相似度得分。相似度得分大于阈值的文档被视为接近重复。在本文中，我们已固定阈值。

著录项

来源
《International conference on advanced data mining and applications;ADMA 2010》|2010年|p.169-180|共12页
会议地点
作者
V.A. Narayana; P. Premchand; A. Govardhan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;TP311.13;
关键词
Fingerprint; Similarity score; Near-duplicate; Web `crawling and Threshold;

机译：指纹;相似度评分;几乎重复; Web爬行和阈值;

相似文献

外文文献
中文文献
专利

1. Effective Detection of Near Duplicate Web Documents in Web Crawling [J] . V.A. Narayana, P. Premchand, A. Govardhan International journal of computational intelligence research . 2009,第1期

机译：在Web爬网中有效检测几乎重复的Web文档
2. Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling [J] . International Journal of Electrical and Computer Engineering . 2012,第6期

机译：Web爬行中检测几乎重复的Web文档的两种相反方法的性能和比较分析
3. Efficient Web Crawling By Detecting and Shunning Near Duplicate Documents [J] . Prasanna Kumar Computer Sciences and Telecommunications . 2009,第5期

机译：通过检测和避开近乎重复的文档来进行有效的Web爬网
4. Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling [C] . V.A. Narayana, P. Premchand, A. Govardhan International conference on advanced data mining and applications . 2010

机译：在Web爬网中修复有效检测近重复的Web文档的阈值
5. Connecting link structure and content on the Web for effective focused crawling. [D] . Nickerson, Adam Stuart. 2003

机译：连接Web上的链接结构和内容，以进行有效的集中爬网。
6. Documenting Alerts within a Web-based Early Event Detection System [O] . Amy Ising, Meichun Li, Anna Waller 2006

机译：在基于Web的早期事件检测系统中记录警报
7. Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling [O] . Hyderabad Ap, P. Premchand, A. Govardhan 2013

机译：Web爬行中检测几乎重复的Web文档的两种相反方法的性能和比较分析

Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling

摘要

著录项

相似文献

相关主题

期刊订阅