首页> 外文学位 >Classifying and searching hidden-web text databases.
【24h】

Classifying and searching hidden-web text databases.

机译:分类和搜索隐藏Web文本数据库。

获取原文
获取原文并翻译 | 示例

摘要

The World-Wide Web continues to grow rapidly, which makes exploiting all available information a challenge. Search engines such as Google index an unprecedented amount of information, but still do not provide access to valuable content in text databases "hidden" behind search interfaces. For example, current search engines largely ignore the contents of the Library of Congress, the US Patent and Trademark database, newspaper archives, and many other valuable sources of information because their contents are not "crawlable." However, users should be able to find the information that they need with as little effort as possible, regardless of whether this information is crawlable or not. As a significant step towards this goal, we have designed algorithms that support browsing and searching---the two dominant ways of finding information on the web---over "hidden-web" text databases.; To support browsing, we have developed QProber, a system that automatically categorizes hidden-web text databases in a classification scheme, according to their topical focus. QProber categorizes databases without retrieving any document. Instead, QProber uses just the number of matches generated from a small number of topically focused query probes. The query probes are automatically generated using state-of-the-art supervised machine learning techniques and are typically short. QProber's classification approach is sometimes orders of magnitude faster than approaches that require document retrieval.; To support searching, we have developed crucial building blocks for constructing sophisticated metasearchers, which search over many text databases at once through a unified query interface. For scalability and effectiveness, it is crucial for a metasearcher to have a good database selection component and send queries only to databases with relevant content. Usually, database selection algorithms rely on statistics that characterize the contents of each database. Unfortunately, many hidden-web text databases are completely autonomous and do not report any summaries of their contents. To build content summaries for such databases, we extract a small, topically focused document sample from each database during categorization and use it to build the respective content summaries. A potential problem with content summaries derived from document samples is that any reasonably small sample will suffer from data sparseness and will riot contain many words that appear in the database. (Abstract shortened by UMI.)
机译:万维网继续快速发展,这使得利用所有可用信息成为一个挑战。诸如Google之类的搜索引擎为空前的信息编制了索引,但仍无法提供对“隐藏”在搜索界面后面的文本数据库中有价值的内容的访问。例如,当前的搜索引擎在很大程度上忽略了国会图书馆,美国专利和商标数据库,报纸档案馆以及许多其他有价值的信息源的内容,因为它们的内容“不可抓取”。但是,无论该信息是否可爬网,用户都应能够尽可能轻松地找到所需的信息。作为朝着这个目标迈出的重要一步,我们设计了支持浏览和搜索的算法-这是在“隐藏的网络”文本数据库上查找信息的两种主要方式。为了支持浏览,我们开发了QProber,这是一个根据其主题重点自动将隐藏的Web文本数据库分类的系统。 QProber对数据库进行分类,而无需检索任何文档。取而代之的是,QProber仅使用从少量局部集中查询探针生成的匹配数目。查询探针是使用最新的有监督的机器学习技术自动生成的,通常很短。 QProber的分类方法有时比需要文档检索的方法快几个数量级。为了支持搜索,我们开发了构建复杂的元搜索器的关键构建块,这些元搜索器通过统一的查询界面一次搜索多个文本数据库。对于可伸缩性和有效性,对于元搜索者来说,拥有一个良好的数据库选择组件并将查询仅发送到具有相关内容的数据库至关重要。通常,数据库选择算法依赖于表征每个数据库内容的统计信息。不幸的是,许多隐藏的Web文本数据库是完全自治的,并且不报告其内容的任何摘要。为了构建此类数据库的内容摘要,我们在分类过程中从每个数据库中提取了一个小而集中的文档样本,并使用它来构建各自的内容摘要。从文档样本派生的内容摘要的潜在问题是,任何合理的小样本都将遭受数据稀疏的困扰,并且骚乱中包含数据库中出现的许多单词。 (摘要由UMI缩短。)

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号