首页> 外文期刊>International journal of software engineering and knowledge engineering >Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier
【24h】

Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

机译:增量分类器指导的基于聚类的主题Web爬行,用于主题特定的信息检索

获取原文
获取原文并翻译 | 示例
       

摘要

Today more and more information on the Web makes it difficult to get domain-specific information due to the huge amount of data sources and the keywords that have few features. Anchor texts, which contain a few features of a specific topic, play an important role in domain-specific information retrieval, especially in Web page classification. However, the features contained in anchor texts are not informative enough. This paper presents a novel incremental method for Web page classification enhanced by link-contexts and clustering. Directly applying the vector of anchor text to a classifier might not get a good result because of the limited amount of features. Link-context is used first to obtain the contextual information of the anchor text. Then, a hierarchical clustering method is introduced to cluster feature vectors and content unit, which increases the length of a feature vector belonging to one specific class. Finally, incremental SVM is proposed to get the final classifier and increase the accuracy and efficiency of a classifier. Experimental results show that the performance of our proposed method outperforms the conventional topical Web crawler in Harvest rate and Target recall.
机译:如今,由于大量的数据源和功能很少的关键字,越来越多的Web信息使得获取特定于域的信息变得困难。锚文本包含特定主题的一些功能,在特定于域的信息检索中,尤其是在网页分类中,起着重要的作用。但是,锚文本中包含的功能信息不足。本文提出了一种通过链接上下文和聚类增强的网页分类增量方法。由于要素数量有限,将锚文本的向量直接应用于分类器可能无法获得良好的结果。首先使用链接上下文来获取锚文本的上下文信息。然后,将层次聚类方法引入到聚类特征向量和内容单元中,这增加了属于一个特定类别的特征向量的长度。最后,提出了增量支持向量机,以得到最终的分类器,并提高了分类器的准确性和效率。实验结果表明,我们提出的方法在“收割率”和“目标召回率”方面优于传统的主题Web爬虫。

著录项

  • 来源
  • 作者

    Tao Peng; Lu Liu;

  • 作者单位

    College of Computer Science and Technology Jilin University, Changchun 130012, China Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Avenue Urbana, IL 61801, USA Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education Changchun 130012, China;

    College of Computer Science and Technology Jilin University, Changchun 130012, China Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Avenue Urbana, IL 61801, USA Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education Changchun 130012, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Web page classification; link-context; CFu-tree; comparison variation (CV); clustering; incremental SVM;

    机译:网页分类;链接上下文CFu树比较变异(CV);集群增量支持向量机;
  • 入库时间 2022-08-18 02:48:19

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号