首页> 外文会议>International conference on very large databases >From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation
【24h】

From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

机译:从重点爬行到专家信息:Web探索和门户网站的应用程序框架

获取原文

摘要

Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a user- or community-specific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populated with relevant high-quality documents for expert Web search. The BINGO! focused crawler implements an approach that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic archetypes (good authorities as determined by Kleinberg's HITS algorithm, and documents classified with high confidence using a linear SVM) and uses them for periodically retraining the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. While a large amount of information can be collected from the "Surface Web" with traditional crawling as done by today's popular search engines, the major part of high quality, topic-specific data is stored in searchable databases that only produce results dynamically in response to a direct request (i.e., the "Hidden Web" or "Deep Web"). Automated meta portal generation for these hidden sources comes with all the traditional problems a meta search engine has to face. The demonstration shows our approach towards fully automated portal generation that merely starts with a small set of user-specific training documents and dynamically builds up a unified database of Surface Web data as well as of indexed Deep Web pages derived from on-the-fly generated Web Service interfaces for form pages leveraging Semantic-Web-style ontologies. The prototype platform has been used for generating two applications that illustrate the effectiveness and versatility of our approach: the Handicrafts Information Portal (HIP) built for the Saarland's Chamber of Trades and Small Businesses, and a movie metaportal coined MIPS. In the following sections we give a short overview of the BINGO! prototype system and then outline the above mentioned application demos.
机译:重点爬行是一种相对较新的,有希望的方法,可以改善网络上专家搜索的召回。它通常从用户或社区特定的主题树开始,以及每个树节点的几个训练文档,然后抓取网页,专注于这些感兴趣的主题。此过程可以有效地构建一个主题特定的分层目录,其节点被填充,具有相关的高质量文档,用于专家Web搜索。宾果游戏!聚焦履带实施一种旨在克服初始培训数据的局限性的方法。为此,宾果园!在主题的爬行和正面分类的文件中识别,特征原型(由Kleinberg的命中算法确定的好的当局​​,以及使用线性SVM的高信心分类的文档,并使用它们来定期再培训分类器;以这种方式,履带程序基于到目前为止看到的最重要文件动态调整。虽然可以通过当今流行的搜索引擎完成的传统爬网的“表面网络”中收集大量信息,但是高质量的主要部分的主要部分存储在可搜索的数据库中,只能在可搜索的数据库中存储,只能响应地产生动态的结果直接请求(即“隐藏网”或“深网络”)。这些隐藏来源的自动元门户生成具有所有传统问题,Meta Search引擎必须面对。演示表明我们对全自动门户生成的方法仅仅以一小部分用户特定的培训文档开始,并动态地构建了表面Web数据的统一数据库以及源自在飞行的索引的深网页上用于表单页面的Web服务接口利用语义Web样式本体。原型平台已被用于生成两个应用程序,说明我们方法的有效性和多功能性:为萨尔群岛商会和小型企业建造的工艺信息门户(臀部),以及电影Metaportal Coined MIPS。在以下部分中,我们还提供了宾果游戏的简短概述!原型系统,然后概述上述应用程序演示。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号