Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

Tao Peng; Lu Liu

首页> 外文期刊>International journal of software engineering and knowledge engineering >Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

【24h】

Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

机译：增量分类器指导的基于聚类的主题Web爬行，用于主题特定的信息检索

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Today more and more information on the Web makes it difficult to get domain-specific information due to the huge amount of data sources and the keywords that have few features. Anchor texts, which contain a few features of a specific topic, play an important role in domain-specific information retrieval, especially in Web page classification. However, the features contained in anchor texts are not informative enough. This paper presents a novel incremental method for Web page classification enhanced by link-contexts and clustering. Directly applying the vector of anchor text to a classifier might not get a good result because of the limited amount of features. Link-context is used first to obtain the contextual information of the anchor text. Then, a hierarchical clustering method is introduced to cluster feature vectors and content unit, which increases the length of a feature vector belonging to one specific class. Finally, incremental SVM is proposed to get the final classifier and increase the accuracy and efficiency of a classifier. Experimental results show that the performance of our proposed method outperforms the conventional topical Web crawler in Harvest rate and Target recall.

机译：如今，由于大量的数据源和功能很少的关键字，越来越多的Web信息使得获取特定于域的信息变得困难。锚文本包含特定主题的一些功能，在特定于域的信息检索中，尤其是在网页分类中，起着重要的作用。但是，锚文本中包含的功能信息不足。本文提出了一种通过链接上下文和聚类增强的网页分类增量方法。由于要素数量有限，将锚文本的向量直接应用于分类器可能无法获得良好的结果。首先使用链接上下文来获取锚文本的上下文信息。然后，将层次聚类方法引入到聚类特征向量和内容单元中，这增加了属于一个特定类别的特征向量的长度。最后，提出了增量支持向量机，以得到最终的分类器，并提高了分类器的准确性和效率。实验结果表明，我们提出的方法在“收割率”和“目标召回率”方面优于传统的主题Web爬虫。

著录项

来源
《International journal of software engineering and knowledge engineering》 |2015年第1期|147-168|共22页
作者
Tao Peng; Lu Liu;
展开▼
作者单位

College of Computer Science and Technology Jilin University, Changchun 130012, China Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Avenue Urbana, IL 61801, USA Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education Changchun 130012, China;

College of Computer Science and Technology Jilin University, Changchun 130012, China Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N Goodwin Avenue Urbana, IL 61801, USA Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education Changchun 130012, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Web page classification; link-context; CFu-tree; comparison variation (CV); clustering; incremental SVM;

机译：网页分类;链接上下文CFu树比较变异（CV）;集群增量支持向量机;
入库时间 2022-08-18 02:48:19

相似文献

外文文献
中文文献
专利

1. Clustering-based topical Web crawling using CFu-tree guided by link-context [J] . Lu LIU, Tao PENG Frontiers of computer science in China . 2014,第4期

机译：在链接上下文的指导下使用CFu树进行基于集群的主题Web爬网
2. Clustering-Based Incremental Web Crawling [J] . QINGZHAO TAN, PRASENJIT MITRA ACM Transactions on Information Systems . 2010,第4期

机译：基于群集的增量Web爬网
3. GUIDE: an interactive and incremental approach for crawling Web applications [J] . Journal of supercomputing . 2020,第3期

机译：指南：一种用于爬网Web应用程序的交互式增量方法
4. Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context [C] . Tao Peng, Fengling He, Wanli Zuo, Mexican International Conference on Artificial Intelligence(MICAI 2006); 20061113-17; Apizaco(MX) . 2006

机译：在链接上下文的指导下针对特定领域资源发现的自适应主题网络爬网
5. Learning to crawl: Classifier-guided topical crawlers. [D] . Pant, Gautam. 2004

机译：学习爬网：分类器指导的主题爬网程序。
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. Clustering-based Incremental Web Crawling [O] . Qingzhao Tan, Prasenjit Mitra 2010

机译：基于聚类的增量Web爬网

Clustering-Based Topical Web Crawling for Topic-Specific Information Retrieval Guided by Incremental Classifier

摘要

著录项

相似文献

相关主题

期刊订阅