首页> 外文会议>Twenty-ninth International Conference on Very Large Databases; Sep 9-12, 2003; Berlin, Germany >From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation
【24h】

From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

机译:从集中抓取到专家信息:Web浏览和门户生成的应用程序框架

获取原文
获取原文并翻译 | 示例

摘要

Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a user- or community-specific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populated with relevant high-quality documents for expert Web search. The BINGO! focused crawler implements an approach that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic archetypes (good authorities as determined by Kleinberg's HITS algorithm, and documents classified with high confidence using a linear SVM) and uses them for periodically retraining the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. While a large amount of information can be collected from the "Surface Web" with traditional crawling as done by today's popular search engines, the major part of high quality, topic-specific data is stored in searchable databases that only produce results dynamically in response to a direct request (i.e., the "Hidden Web" or "Deep Web"). Automated meta portal generation for these hidden sources comes with all the traditional problems a meta search engine has to face. The demonstration shows our approach towards fully automated portal generation that merely starts with a small set of user-specific training documents and dynamically builds up a unified database of Surface Web data as well as of indexed Deep Web pages derived from on-the-fly generated Web Service interfaces for form pages leveraging Semantic-Web-style ontologies. The prototype platform has been used for generating two applications that illustrate the effectiveness and versatility of our approach: the Handicrafts Information Portal (HIP) built for the Saarland's Chamber of Trades and Small Businesses, and a movie metaportal coined MIPS. In the following sections we give a short overview of the BINGO! prototype system and then outline the above mentioned application demos.
机译:集中式爬网是一种相对较新的,有前途的方法,可以改善Web上专家搜索的召回率。它通常从特定于用户或社区的主题树以及每个树节点的一些培训文档开始,然后重点关注这些感兴趣的主题来爬网。此过程可以有效地构建特定于主题的层次目录,该目录的节点中填充有用于专家Web搜索的相关高质量文档。宾果!专注于爬虫的工具旨在克服初始训练数据的局限性。为此,宾果!在已检索和正向分类的主题文档中,识别特征原型(由Kleinberg的HITS算法确定的良好权威,以及使用线性SVM高可信度分类的文档),并将其用于定期重新训练分类器;这样,爬虫将根据迄今为止看到的最重要的文档进行动态调整。尽管可以像当今流行的搜索引擎一样通过传统的爬网从“曲面Web”中收集大量信息,但高质量的主题特定数据的主要部分存储在可搜索的数据库中,这些数据库只能根据以下情况动态生成结果:直接请求(即“隐藏网站”或“深层网站”)。这些隐藏源的自动元门户生成带有元搜索引擎必须面对的所有传统问题。该演示展示了我们实现全自动门户生成的方法,该方法仅从一小组针对特定用户的培训文档开始,并动态构建Surface Web数据以及从实时生成的索引化Deep Web页面的统一数据库利用语义Web样式本体的表单页面的Web服务接口。原型平台已用于生成两个应用程序,这些应用程序说明了我们方法的有效性和多功能性:为萨尔州商会和小型企业建立的手工艺品信息门户(HIP),以及由电影制作的MIPS门户。在以下各节中,我们简要概述了BINGO!原型系统,然后概述上述应用程序演示。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号