Classifying and searching hidden-web text databases.

机译：分类和搜索隐藏Web文本数据库。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The World-Wide Web continues to grow rapidly, which makes exploiting all available information a challenge. Search engines such as Google index an unprecedented amount of information, but still do not provide access to valuable content in text databases "hidden" behind search interfaces. For example, current search engines largely ignore the contents of the Library of Congress, the US Patent and Trademark database, newspaper archives, and many other valuable sources of information because their contents are not "crawlable." However, users should be able to find the information that they need with as little effort as possible, regardless of whether this information is crawlable or not. As a significant step towards this goal, we have designed algorithms that support browsing and searching---the two dominant ways of finding information on the web---over "hidden-web" text databases.; To support browsing, we have developed QProber, a system that automatically categorizes hidden-web text databases in a classification scheme, according to their topical focus. QProber categorizes databases without retrieving any document. Instead, QProber uses just the number of matches generated from a small number of topically focused query probes. The query probes are automatically generated using state-of-the-art supervised machine learning techniques and are typically short. QProber's classification approach is sometimes orders of magnitude faster than approaches that require document retrieval.; To support searching, we have developed crucial building blocks for constructing sophisticated metasearchers, which search over many text databases at once through a unified query interface. For scalability and effectiveness, it is crucial for a metasearcher to have a good database selection component and send queries only to databases with relevant content. Usually, database selection algorithms rely on statistics that characterize the contents of each database. Unfortunately, many hidden-web text databases are completely autonomous and do not report any summaries of their contents. To build content summaries for such databases, we extract a small, topically focused document sample from each database during categorization and use it to build the respective content summaries. A potential problem with content summaries derived from document samples is that any reasonably small sample will suffer from data sparseness and will riot contain many words that appear in the database. (Abstract shortened by UMI.)

机译：万维网继续快速发展，这使得利用所有可用信息成为一个挑战。诸如Google之类的搜索引擎为空前的信息编制了索引，但仍无法提供对“隐藏”在搜索界面后面的文本数据库中有价值的内容的访问。例如，当前的搜索引擎在很大程度上忽略了国会图书馆，美国专利和商标数据库，报纸档案馆以及许多其他有价值的信息源的内容，因为它们的内容“不可抓取”。但是，无论该信息是否可爬网，用户都应能够尽可能轻松地找到所需的信息。作为朝着这个目标迈出的重要一步，我们设计了支持浏览和搜索的算法-这是在“隐藏的网络”文本数据库上查找信息的两种主要方式。为了支持浏览，我们开发了QProber，这是一个根据其主题重点自动将隐藏的Web文本数据库分类的系统。 QProber对数据库进行分类，而无需检索任何文档。取而代之的是，QProber仅使用从少量局部集中查询探针生成的匹配数目。查询探针是使用最新的有监督的机器学习技术自动生成的，通常很短。 QProber的分类方法有时比需要文档检索的方法快几个数量级。为了支持搜索，我们开发了构建复杂的元搜索器的关键构建块，这些元搜索器通过统一的查询界面一次搜索多个文本数据库。对于可伸缩性和有效性，对于元搜索者来说，拥有一个良好的数据库选择组件并将查询仅发送到具有相关内容的数据库至关重要。通常，数据库选择算法依赖于表征每个数据库内容的统计信息。不幸的是，许多隐藏的Web文本数据库是完全自治的，并且不报告其内容的任何摘要。为了构建此类数据库的内容摘要，我们在分类过程中从每个数据库中提取了一个小而集中的文档样本，并使用它来构建各自的内容摘要。从文档样本派生的内容摘要的潜在问题是，任何合理的小样本都将遭受数据稀疏的困扰，并且骚乱中包含数据库中出现的许多单词。（摘要由UMI缩短。）

著录项

作者
Ipeirotis, Panagiotis G.;
展开▼
作者单位

Columbia University.;

展开▼
授予单位 Columbia University.;
学科 Computer Science.
学位 Ph.D.
年度 2004
页码 205 p.
总页数 205
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Avoidance of Ranking Capabilities in Retrieval of Queries on Hidden-Web Text Databases [J] . S K.Rubeena, T. Srinivasa Rao International Journal of Engineering Research and Applications . 2013,第5期

机译：避免在隐藏的Web文本数据库中检索查询中的排名功能
2. Relevance-Based Retrieval on Hidden-Web Text Databases without Ranking Support [J] . Hristidis Vagelis, Hu Yuheng, Ipeirotis Panagiotis Knowledge and Data Engineering, IEEE Transactions on . 2011,第10期

机译：不基于排名的隐藏Web文本数据库上基于相关性的检索
3. Classification-Aware Hidden-Web Text Database Selection [J] . PANAGIOTIS G. IPEIROTIS, LUIS GRAVANO ACM Transactions on Information Systems . 2008,第2期

机译：分类感知隐藏Web文本数据库选择
4. Video Searching and Retrieval using Scene Classification in Multimedia Databases. [C] . Pranabjyoti Haloi, M. K. Bhuyan International Conference for Emerging Technology . 2021

机译：视频搜索和检索在多媒体数据库中使用场景分类。
5. HARDWARE FOR SEARCHING VERY LARGE TEXT DATABASES. [D] . HASKIN, ROGER LEE. 1980

机译：搜索非常大的文本数据库的硬件。
6. TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. [O] . A. C. Wallace, N. Borkakoti, J. M. Thornton 1997

机译：TESS：一种几何哈希算法用于推导用于搜索结构数据库的3D坐标模板。应用于酶活性位点。
7. Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes [O] . Ipeirotis Panagiotis G., Gravano Luis 2001

机译：使用聚焦探针分层汇总和搜索隐藏Web数据库

Classifying and searching hidden-web text databases.

摘要

著录项

相似文献

相关主题

期刊订阅