Effective Detection of Near Duplicate Web Documents in Web Crawling

V.A. Narayana; P. Premchand; A. Govardhan

首页> 外文期刊>International journal of computational intelligence research >Effective Detection of Near Duplicate Web Documents in Web Crawling

【24h】

Effective Detection of Near Duplicate Web Documents in Web Crawling

机译：在Web爬网中有效检测几乎重复的Web文档

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In recent times, the concept of Web Crawling has received remarkable significance owing to the drastic development of the World Wide Web. Huge challenges have been posed by the voluminous amounts of web documents swarming the web to the web search engines making their less appropriate to the users. Additional overheads are created for the search engines by the presence of duplicate and near duplicate web documents in abundance, by which their performance and quality is significantly affected. The web crawling research community has extensively recognized the detection of duplicate and near duplicate web pages. Providing the users with pertinent results for their queries in the first page without duplicate and redundant results is a vital requisite. We have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling in this paper. The near duplicate web pages are detected followed by the storage of crawled web pages in to repositories. The keywords are extracted from the crawled pages initially and on the basis of the extracted keywords, the similarity score between the two pages is calculated. The documents are considered as near duplicates if its similarity scores are lesser than a threshold value. Memory for repositories has been reduced and the search engine quality has been improved owing to the detection.

机译：近年来，由于万维网的迅猛发展，Web爬行的概念已收到了显着的意义。大量的网络文档给网络搜索引擎带来了巨大的挑战，使网络不适合用户。大量重复和几乎重复的Web文档的存在为搜索引擎带来了额外的开销，从而大大影响了它们的性能和质量。网络搜寻研究社区已广泛认识到检测重复网页和几乎重复的网页。在第一页为用户提供有关其查询的相关结果而又没有重复和多余的结果是至关重要的。在本文中，我们提出了一种新颖有效的方法来检测网络爬网中几乎重复的网页。检测到几乎重复的网页，然后将爬网的网页存储到存储库中。最初从爬网的页面中提取关键字，并基于提取的关键字来计算两个页面之间的相似度得分。如果文档的相似性分数小于阈值，则将其视为接近重复。由于检测，存储库的内存已减少，搜索引擎的质量得到了提高。

著录项

来源
《International journal of computational intelligence research 》 |2009年第1期| 83-96| 共14页
作者
V.A. Narayana; P. Premchand; A. Govardhan;
展开▼
作者单位

CSE Department, CMR College of Engineering & Technology, JNTU, Hyderabad, India;

Department of Computer Science & Engineering, University College of Engineering, Osmania University Hyderabad-500007, AP, India;

Department of Computer Science and Engineering, JNTU College of Engineering Kukatpally, Hyderabad, India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
web mining; web content mining; web crawling; web pages; stemming; common words; near duplicate pages; near duplicate detection;

机译：网络挖掘;网站内容挖掘;网络爬行;网页;茎常用的词;几乎重复的页面;几乎重复检测;

相似文献

外文文献
中文文献
专利

1. Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling [J] . International Journal of Electrical and Computer Engineering . 2012 ,第6期

机译：Web爬行中检测几乎重复的Web文档的两种相反方法的性能和比较分析
2. Efficient Web Crawling By Detecting and Shunning Near Duplicate Documents [J] . Prasanna Kumar Computer Sciences and Telecommunications . 2009 ,第5期

机译：通过检测和避开近乎重复的文档来进行有效的Web爬网
3. Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search [J] . Y. Syed Mudhasir, J. Deepika, S. Sendhilkumar, International Journal on Internet and Distributed Computing Systems . 2011 ,第1期

机译：基于Web来源的近重复检测和消除以实现有效的Web搜索
4. To create a confusion matrix in respect of threshold being fixed for effective detection of near duplicate web documents in Web Crawling [C] . Narayana V. A., Govardhan A, Premchand P. 6th International Conference on Computer Sciences and Convergence Information Technology. . 2011

机译：创建固定阈值的混淆矩阵，以有效检测Web爬网中几乎重复的Web文档
5. Connecting link structure and content on the Web for effective focused crawling. [D] . Nickerson, Adam Stuart. 2003

机译：连接Web上的链接结构和内容，以进行有效的集中爬网。
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling [O] . Hyderabad Ap, P. Premchand, A. Govardhan 2013

机译：Web爬行中检测几乎重复的Web文档的两种相反方法的性能和比较分析

Effective Detection of Near Duplicate Web Documents in Web Crawling

摘要

著录项

相似文献

相关主题

期刊订阅