【24h】

Focused Crawling with Heterogeneous Semantic Information

机译:重点爬行异构语义信息

获取原文

摘要

Focused crawlers selectively retrieve Web documents that are relevant to a predefined set of topics. To intelligently make predictions and decisions about relevant URLs and web pages, different topic models have been introduced to represent topic-specific knowledge. Yet it is difficult to support semantic interoperability among different models. Moreover, some manually specified additional semantic information, such as semantic markups and social annotations, could not be effectively used to improve crawling. This paper proposes to boost focused crawling with four kinds of semantic models and semantic information, including thesauruses, categories, ontologies, and folksonomies. A statistical semantic association model is proposed to integrate different semantic models, represent heterogeneous semantic information, and support semantic relevance computation. A focused crawling framework is developed which adopts both keyword based contents and different kinds of additional information for relevance prediction and ranking. Experiments show that the proposed model and framework effectively integrates heterogeneous semantic information for focused crawling.
机译:聚焦爬虫选择性地检索与预定义主题相关的Web文档。为了智能地对相关URL和网页的预测和决策,已经引入了不同的主题模型来表示特定于主题的知识。然而,很难支持不同模型之间的语义互操作性。此外,一些手动指定的其他语义信息,例如语义标记和社会注释,无法有效地用于改善爬网。本文建议促进重点爬行,以四种语义模型和语义信息,包括杂散,分类,本体和愚蠢商。提出了一种统计语义关联模型来集成不同的语义模型,代表异构语义信息,支持语义相关计算。开发了一个聚焦爬行框架,其采用基于关键字的内容和相关的相关性预测和排名的不同类型的信息。实验表明,拟议的模型和框架有效地集成了聚焦爬网的异构语义信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号