...
首页> 外文期刊>Concurrency and Computation >SOF: a semi-supervised ontology-learning-based focused crawler
【24h】

SOF: a semi-supervised ontology-learning-based focused crawler

机译:SOF:基于半监督的本体学习的集中式爬虫

获取原文
获取原文并翻译 | 示例

摘要

The rapid increase in the volume of data available on the Internet makes it increasingly impractical for a crawler to index the whole Web. Instead, many intelligent crawlers, known as ontology-based semantic focused crawlers, have been designed by making use of Semantic Web technologies for topic-centered Web information crawling. Ontologies, however, have constraints of validity and time, which may influence the performance of the crawlers. Ontology-learning-based focused crawlers are therefore designed to automatically evolve ontologies by integrating ontology learning technologies. Nevertheless, surveys indicate that the existing ontology-learning-based focused crawlers do not have the capability to automatically enrich the content of ontologies, which makes these crawlers unreliable in the open and heterogeneous Web environment. Hence, in this paper, we propose a framework for a novel semi-supervised ontology-learning-based focused (SOF) crawler, the SOF crawler, which embodies a series of schemas for ontology generation and Web information formatting, a semi-supervised ontology learning framework, and a hybrid Web page classification approach aggregated by a group of support vector machine models. A series of tests are implemented to evaluate the technical feasibility of this proposed framework. The conclusion and the future work are summarized in the final section.
机译:Internet上可用数据量的迅速增加,使得搜寻器为整个Web编制索引变得越来越不切实际。取而代之的是,许多智能爬网程序(称为基于本体的语义集中爬网程序)已经通过使用语义Web技术进行了设计,以主题为中心的Web信息爬网。但是,本体具有有效性和时间约束,这可能会影响爬虫的性能。因此,基于本体学习的专注爬虫被设计为通过集成本体学习技术来自动演化本体。尽管如此,调查表明,现有的基于本体学习的重点爬虫没有自动丰富本体内容的能力,这使得这些爬虫在开放和异构的Web环境中不可靠。因此,在本文中,我们提出了一种新型的基于半监督的基于本体学习的聚焦(SOF)搜寻器的框架,即SOF搜寻器,该框架包含了一系列用于本体生成和Web信息格式化的模式,一个半监督的本体学习框架,以及由一组支持向量机模型聚合而成的混合Web页面分类方法。实施了一系列测试以评估此提议框架的技术可行性。结论和未来的工作总结在最后一部分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号