From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

机译：从集中抓取到专家信息：Web浏览和门户生成的应用程序框架

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a user- or community-specific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populated with relevant high-quality documents for expert Web search. The BINGO! focused crawler implements an approach that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic archetypes (good authorities as determined by Kleinberg's HITS algorithm, and documents classified with high confidence using a linear SVM) and uses them for periodically retraining the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. While a large amount of information can be collected from the "Surface Web" with traditional crawling as done by today's popular search engines, the major part of high quality, topic-specific data is stored in searchable databases that only produce results dynamically in response to a direct request (i.e., the "Hidden Web" or "Deep Web"). Automated meta portal generation for these hidden sources comes with all the traditional problems a meta search engine has to face. The demonstration shows our approach towards fully automated portal generation that merely starts with a small set of user-specific training documents and dynamically builds up a unified database of Surface Web data as well as of indexed Deep Web pages derived from on-the-fly generated Web Service interfaces for form pages leveraging Semantic-Web-style ontologies. The prototype platform has been used for generating two applications that illustrate the effectiveness and versatility of our approach: the Handicrafts Information Portal (HIP) built for the Saarland's Chamber of Trades and Small Businesses, and a movie metaportal coined MIPS. In the following sections we give a short overview of the BINGO! prototype system and then outline the above mentioned application demos.

机译：集中式爬网是一种相对较新的，有前途的方法，可以改善Web上专家搜索的召回率。它通常从特定于用户或社区的主题树以及每个树节点的一些培训文档开始，然后重点关注这些感兴趣的主题来爬网。此过程可以有效地构建特定于主题的层次目录，该目录的节点中填充有用于专家Web搜索的相关高质量文档。宾果！专注于爬虫的工具旨在克服初始训练数据的局限性。为此，宾果！在已检索和正向分类的主题文档中，识别特征原型（由Kleinberg的HITS算法确定的良好权威，以及使用线性SVM高可信度分类的文档），并将其用于定期重新训练分类器；这样，爬虫将根据迄今为止看到的最重要的文档进行动态调整。尽管可以像当今流行的搜索引擎一样通过传统的爬网从“曲面Web”中收集大量信息，但高质量的主题特定数据的主要部分存储在可搜索的数据库中，这些数据库只能根据以下情况动态生成结果：直接请求（即“隐藏网站”或“深层网站”）。这些隐藏源的自动元门户生成带有元搜索引擎必须面对的所有传统问题。该演示展示了我们实现全自动门户生成的方法，该方法仅从一小组针对特定用户的培训文档开始，并动态构建Surface Web数据以及从实时生成的索引化Deep Web页面的统一数据库利用语义Web样式本体的表单页面的Web服务接口。原型平台已用于生成两个应用程序，这些应用程序说明了我们方法的有效性和多功能性：为萨尔州商会和小型企业建立的手工艺品信息门户（HIP），以及由电影制作的MIPS门户。在以下各节中，我们简要概述了BINGO！原型系统，然后概述上述应用程序演示。

著录项

来源
《Twenty-ninth International Conference on Very Large Databases; Sep 9-12, 2003; Berlin, Germany》|2003年|p.1105-1108|共4页
会议地点 Berlin(DE);Berlin(DE)
作者
Sergej Sizov; Jens Graupmann; Martin Theobald;
展开▼
作者单位

University of the Saarland Department of Computer Science P.O. Box 151150, 66041 Saarbruecken, Germany;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
入库时间 2022-08-26 14:15:36

相似文献

外文文献
中文文献
专利

1. Extended CurlCrawler: A focused and path-oriented framework for crawling the web with thumb [J] . Dr Ela Kumar, Ashok Kumar International Journal of Computer Trends and Technology . 2012,第3期

机译：扩展的CurlCrawler：一个集中的，面向路径的框架，可用于以拇指抓取网络
2. A New Framework for Focused Web Crawling [J] . PENG Tao, HE Fengling, ZUO Wanli Wuhan University Journal of Natural Sciences . 2006,第5期

机译：专注于Web爬行的新框架
3. Application of structured document parsing to focused web crawling [J] . Ahmed Patel, Nikita Schmidt Computer standards & interfaces . 2011,第3期

机译：结构化文档解析在重点网页爬取中的应用
4. From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation [C] . Sergej Sizov, Jens Graupmann, Martin Theobald International conference on very large databases . 2003

机译：从重点爬行到专家信息：Web探索和门户网站的应用程序框架
5. JTracer: A framework for automatic test generation for secure Web applications. [D] . Herrera Aguirre, Edward Javier. 2010

机译：JTracer：自动生成安全Web应用程序测试的框架。
6. WIDDE: a Web-Interfaced next generation database for genetic diversity exploration with a first application in cattle [O] . Guilhem Sempéré, Katayoun Moazami-Goudarzi, André Eggen, 2015

机译：WIDDE：用于遗传多样性探索的网络接口的下一代数据库首次在牛中应用
7. Automatic Generation of Thematically Focused Information Portals from Web Data [O] . Sizov Sergej 2005

机译：从Web数据自动生成以主题为中心的信息门户
8. Focused Crawling of the Deep Web Using Service Class Descriptions [R] . Rocco, D., Liu, L., Critchlow, T. 2005

机译：使用服务类描述重点对Deep Web进行爬网

From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation

摘要

著录项

相似文献

相关主题

期刊订阅