首页> 外文OA文献 >Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes
【2h】

Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes

机译:使用聚焦探针分层汇总和搜索隐藏Web数据库

摘要

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. The content summaries that result from this algorithm are efficient to derive and more accurate than those from previously proposed probing techniques for content-summary extraction. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to produce accurate results even for imperfect content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases.
机译:网络上许多有价值的文本数据库都有不可检索的内容,这些内容“隐藏”在搜索界面的后面。元搜索器是有用的工具,可通过统一的查询界面一次搜索许多此类数据库。对于元搜索者来说,有效地处理查询的关键任务是选择最有希望的查询数据库,该任务通常依赖于数据库内容的统计摘要。不幸的是,可通过网络访问的文本数据库通常不会导出内容摘要。在本文中,我们提出了一种算法,该算法通过使用“焦点查询探针”从“不合作”数据库中获取内容摘要,该算法自适应放大并提取代表数据库主题覆盖范围的文档。与先前提出的用于内容摘要提取的探测技术相比,此算法产生的内容摘要具有更高的导出效率和准确性。我们还提出了一种新颖的数据库选择算法,该算法利用提取的内容摘要和数据库的分层分类(在探测过程中自动得出),即使对于不完善的内容摘要也可以产生准确的结果。最后,我们使用各种数据库(包括50个可通过网络访问的实际文本数据库)全面评估我们的技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号