首页> 外文期刊>International Journal on Internet and Distributed Computing Systems >Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search
【24h】

Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search

机译:基于Web来源的近重复检测和消除以实现有效的Web搜索

获取原文
           

摘要

Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. However, the performance of a web search is greatly affected by flooding of search results with information that is redundant in nature i.e., existence of near-duplicates. Such near-duplicates holdup the other promising results to the users. Many of these near-duplicates are from distrusted websites and/or authors who host information on web. Such near-duplicates may be eliminated by means of Provenance. Thus, this paper proposes a novel approach to identify such near-duplicates based on provenance. In this approach a provenance model has been built using web pages which are the search results returned by existing search engine. The proposed model combines both content based and trust based factors for classifying the results as original or near-duplicates
机译:万维网的用户利用搜索引擎在网络中检索信息,因为搜索引擎在寻找网络信息方面起着至关重要的作用。但是,网络搜索的性能在很大程度上受到搜索结果泛滥的影响,这些信息本质上是多余的,即存在重复项。这样的重复几乎为用户带来了其他有希望的结果。这些近重复项中有许多来自不信任的网站和/或在网络上托管信息的作者。可以通过出处消除这种重复的现象。因此,本文提出了一种新颖的方法来基于来源鉴定这种近重复。在这种方法中,已使用网页建立了物源模型,这些网页是现有搜索引擎返回的搜索结果。提出的模型结合了基于内容和基于信任的因素,将结果分类为原始或近似重复

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号