首页> 中文期刊> 《模式识别与人工智能》 >基于Top-k查询约束的深网增量爬取

基于Top-k查询约束的深网增量爬取

         

摘要

Crawling all deep web data is difficult for third party applications due to dynamicity, autonomy and quantity of deep web data sources. To tackle the deep web crawling problem under the query type restriction(only top-k queries are allowed) and limited query resources, an approach for incremental web crawling with top-k query constraint is proposed. Historical data and domain knowledge are combined to maximize total repository data quality. Firstly, valid queries are generated using a query tree, and changes and corresponding cost of the query are estimated by historical data and domain knowledge. Next, grounded on the query cost and data quality of the estimation, the optimal subset is selected approximately to globally maximize total data quality under limited query resources. The experimental results on real datasets show the proposed approach improves the efficiency of crawling dynamic web database.%深网数据源的动态性、自治性和体量使第三方应用难以完全爬取所有Web数据.文中研究查询类型(仅允许Top-k查询)和查询资源约束下深网数据源爬取问题,提出基于Top-k查询约束的深网增量爬取方法,结合历史数据和领域知识,优化总体数据质量.首先基于查询树获得有效查询,利用历史数据和领域知识估计查询变化和查询代价.然后,基于估计的查询代价和数据质量,近似选择最优的查询子集最大化总体数据质量.实验表明文中方法较好地提高动态Web数据库爬取的效率和数据质量.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号