首页> 外文学位 >Crawling and searching the hidden Web.
【24h】

Crawling and searching the hidden Web.

机译:搜寻和搜索隐藏的Web。

获取原文
获取原文并翻译 | 示例

摘要

An ever-increasing amount of valuable information on the Web today is hidden behind search interfaces. This information is collectively called the Hidden Web.; In this dissertation, we study how we can effectively collect the data from the Hidden Web and enable the users to search for information within the collected data. More specifically, we address some of the main challenges involved in creating a search engine for the Hidden Web:; Crawling the Hidden Web. We study how to build an effective Hidden-Web crawler that can facilitate the collection of information from the Hidden Web. Since there are no links to the Hidden Web pages, our crawler needs to automatically come up with queries to issue to the Hidden Web sites. We propose three different query generation policies for the Hidden Web: a policy that picks queries at random from a list of keywords, a policy that picks queries based on their frequency in a generic text collection, and a policy which adaptively picks a good query based on the content of the pages downloaded from the Hidden-Web site. We compare the effectiveness of our policies by crawling a number of real Hidden-Web sites.; Updating the Hidden-Web pages. The information on the Web today is constantly evolving. Once our crawler has downloaded the information from the Hidden-Web, it needs to periodically refresh its local copy in order to enable users to search for up-to-date information. We study the evolution of searchable Web sites using real data collected from the Web over a period of one year. We also propose an efficient sampling-based policy for updating the pages.; Indexing and searching the Hidden Web. Once we have downloaded the Hidden-Web pages, we can enable the users to search for useful information. Search engines typically do this by maintaining large-scale inverted indexes which are replicated dozens of times for scalability and which are then pruned in order to reduce the cost of operation. We show that the current approaches employed by the search engines may result in significant degradation in the quality of results. To alleviate this problem, we propose modifications to current pruning techniques so that we avoid any degradation in quality, while realizing the benefit of lower cost of operation.; Fighting Web spam. In the last few years, several sites on the Web observe an ever-increasing portion of their traffic coming from search engine referrals. Given the large fraction of Web traffic originating from searches and the high potential monetary value of this traffic, some Web site operators try to influence the positioning of their pages within search results by crafting spam Web pages. In the case of the Hidden Web, these malicious Web site operators may try to pollute our index for their own benefit by injecting spam content in their Hidden-Web databases so that our crawler can download it. In this dissertation, we study the prevalence of spam on the Web and we present a number of techniques to detect Web spam. We also show how to use machine learning techniques to combine our techniques into creating a more effective spam detection mechanism.; The techniques proposed in this dissertation have been incorporated in a prototype search engine that currently indexes a few million pages from the Hidden Web.
机译:如今,搜索界面背后隐藏着越来越多的有价值的Web信息。该信息统称为“隐藏网”。在本文中,我们研究如何有效地从隐藏Web收集数据,并使用户能够在收集的数据中搜索信息。更具体地说,我们解决了为隐网创建搜索引擎所涉及的一些主要挑战:搜寻隐藏的网站。我们研究了如何构建有效的隐藏Web搜寻器,该爬虫可以促进从隐藏Web收集信息。由于没有指向隐藏网页的链接,因此我们的搜寻器需要自动提出查询以发布到隐藏网站。我们为隐藏Web提出了三种不同的查询生成策略:从关键字列表中随机选择查询的策略,根据通用文本集合中查询的频率选择查询的策略以及根据以下内容自适应选择良好查询的策略:从“隐藏网站”下载的页面的内容上。我们通过抓取许多真实的隐藏网站来比较我们策略的有效性。更新隐藏网页。当今网络上的信息正在不断发展。我们的搜寻器从Hidden-Web下载信息后,便需要定期刷新其本地副本,以便使用户能够搜索最新信息。我们使用一年内从Web收集的真实数据来研究可搜索Web站点的发展。我们还提出了一种有效的基于采样的策略来更新页面。索引和搜索隐藏的网站。一旦下载了“隐藏网页”,我们就可以使用户搜索有用的信息。搜索引擎通常通过维护大规模的倒排索引来实现此目的,将其重复复制数十次以实现可伸缩性,然后对其进行修剪以降低运营成本。我们表明,搜索引擎采用的当前方法可能会导致结果质量显着下降。为了减轻这个问题,我们建议对当前的修剪技术进行修改,以便避免质量下降,同时实现较低的运营成本。打击网络垃圾邮件。在过去的几年中,Web上的多个站点发现来自搜索引擎引荐的流量越来越多。考虑到大部分来自搜索的Web流量以及这种流量的潜在高额货币价值,一些网站运营商试图通过设计垃圾邮件Web页面来影响其页面在搜索结果中的位置。对于“隐藏的Web”,这些恶意网站运营商可能会通过将垃圾邮件内容注入其“隐藏的Web”数据库中,以便我们的爬网程序可以下载它,从而为自己的利益污染我们的索引。本文研究了网络上垃圾邮件的流行情况,并提出了多种检测网络垃圾邮件的技术。我们还将展示如何使用机器学习技术将我们的技术结合起来以创建更有效的垃圾邮件检测机制。本论文提出的技术已被纳入一个原型搜索引擎中,该引擎目前从Hidden Web索引了几百万个页面。

著录项

  • 作者

    Ntoulas, Alexandros.;

  • 作者单位

    University of California, Los Angeles.;

  • 授予单位 University of California, Los Angeles.;
  • 学科 Engineering System Science.; Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 243 p.
  • 总页数 243
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 系统科学;自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号