From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

机译：从重点爬行到专家信息：Web探索和门户网站的应用程序框架

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a user- or community-specific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populated with relevant high-quality documents for expert Web search. The BINGO! focused crawler implements an approach that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic archetypes (good authorities as determined by Kleinberg's HITS algorithm, and documents classified with high confidence using a linear SVM) and uses them for periodically retraining the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. While a large amount of information can be collected from the "Surface Web" with traditional crawling as done by today's popular search engines, the major part of high quality, topic-specific data is stored in searchable databases that only produce results dynamically in response to a direct request (i.e., the "Hidden Web" or "Deep Web"). Automated meta portal generation for these hidden sources comes with all the traditional problems a meta search engine has to face. The demonstration shows our approach towards fully automated portal generation that merely starts with a small set of user-specific training documents and dynamically builds up a unified database of Surface Web data as well as of indexed Deep Web pages derived from on-the-fly generated Web Service interfaces for form pages leveraging Semantic-Web-style ontologies. The prototype platform has been used for generating two applications that illustrate the effectiveness and versatility of our approach: the Handicrafts Information Portal (HIP) built for the Saarland's Chamber of Trades and Small Businesses, and a movie metaportal coined MIPS. In the following sections we give a short overview of the BINGO! prototype system and then outline the above mentioned application demos.

机译：重点爬行是一种相对较新的，有希望的方法，可以改善网络上专家搜索的召回。它通常从用户或社区特定的主题树开始，以及每个树节点的几个训练文档，然后抓取网页，专注于这些感兴趣的主题。此过程可以有效地构建一个主题特定的分层目录，其节点被填充，具有相关的高质量文档，用于专家Web搜索。宾果游戏！聚焦履带实施一种旨在克服初始培训数据的局限性的方法。为此，宾果园！在主题的爬行和正面分类的文件中识别，特征原型（由Kleinberg的命中算法确定的好的当局，以及使用线性SVM的高信心分类的文档，并使用它们来定期再培训分类器;以这种方式，履带程序基于到目前为止看到的最重要文件动态调整。虽然可以通过当今流行的搜索引擎完成的传统爬网的“表面网络”中收集大量信息，但是高质量的主要部分的主要部分存储在可搜索的数据库中，只能在可搜索的数据库中存储，只能响应地产生动态的结果直接请求（即“隐藏网”或“深网络”）。这些隐藏来源的自动元门户生成具有所有传统问题，Meta Search引擎必须面对。演示表明我们对全自动门户生成的方法仅仅以一小部分用户特定的培训文档开始，并动态地构建了表面Web数据的统一数据库以及源自在飞行的索引的深网页上用于表单页面的Web服务接口利用语义Web样式本体。原型平台已被用于生成两个应用程序，说明我们方法的有效性和多功能性：为萨尔群岛商会和小型企业建造的工艺信息门户（臀部），以及电影Metaportal Coined MIPS。在以下部分中，我们还提供了宾果游戏的简短概述！原型系统，然后概述上述应用程序演示。

著录项

来源
《International conference on very large databases》|2003年||共4页
会议地点
作者
Sergej Sizov; Jens Graupmann; Martin Theobald;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化系统理论;
关键词

相似文献

外文文献
中文文献
专利

1. Extended CurlCrawler: A focused and path-oriented framework for crawling the web with thumb [J] . Dr Ela Kumar, Ashok Kumar International Journal of Computer Trends and Technology . 2012,第3期

机译：扩展的CurlCrawler：一个集中的，面向路径的框架，可用于以拇指抓取网络
2. A New Framework for Focused Web Crawling [J] . PENG Tao, HE Fengling, ZUO Wanli Wuhan University Journal of Natural Sciences . 2006,第5期

机译：专注于Web爬行的新框架
3. Application of structured document parsing to focused web crawling [J] . Ahmed Patel, Nikita Schmidt Computer standards & interfaces . 2011,第3期

机译：结构化文档解析在重点网页爬取中的应用
4. From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation [C] . Sergej Sizov, Jens Graupmann, Martin Theobald Twenty-ninth International Conference on Very Large Databases; Sep 9-12, 2003; Berlin, Germany . 2003

机译：从集中抓取到专家信息：Web浏览和门户生成的应用程序框架
5. JTracer: A framework for automatic test generation for secure Web applications. [D] . Herrera Aguirre, Edward Javier. 2010

机译：JTracer：自动生成安全Web应用程序测试的框架。
6. WIDDE: a Web-Interfaced next generation database for genetic diversity exploration with a first application in cattle [O] . Guilhem Sempéré, Katayoun Moazami-Goudarzi, André Eggen, 2015

机译：WIDDE：用于遗传多样性探索的网络接口的下一代数据库首次在牛中应用
7. Automatic Generation of Thematically Focused Information Portals from Web Data [O] . Sizov Sergej 2005

机译：从Web数据自动生成以主题为中心的信息门户
8. Focused Crawling of the Deep Web Using Service Class Descriptions [R] . Rocco, D., Liu, L., Critchlow, T. 2005

机译：使用服务类描述重点对Deep Web进行爬网

From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

摘要

著录项

相似文献

相关主题

期刊订阅