首页> 外文期刊>Expert Systems with Application >A Web page classification system based on a genetic algorithm using tagged-terms as features
【24h】

A Web page classification system based on a genetic algorithm using tagged-terms as features

机译:一种基于遗传算法的网页分类系统,使用标记词作为特征

获取原文
获取原文并翻译 | 示例
           

摘要

The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (CA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our CA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naive Bayes and k nearest neighbor classifiers.
机译:万维网上信息量的惊人增长导致了特定主题的Web爬行的诞生。在重点爬网过程中,需要一种自动的网页分类机制来确定所考虑的页面是否在主题上。在这项研究中,开发了一种基于遗传算法(CA)的自动网页分类系统,该系统使用HTML标签和属于每个标签的术语作为分类特征,并从训练数据集中的正负网页中学习最佳分类器。我们的系统通过简单地计算学习的分类器和新网页之间的相似度来对网页进行分类。在现有的基于GA的分类器中,仅将HTML标记或术语用作特征,但是在本研究中,将这两个标记或术语结合在一起,并由我们的CA了解这些特征的最佳权重。发现将HTML标签和每个标签中的术语用作单独的功能可以提高分类的准确性,并且训练数据集中的文档数会影响准确性,因此,如果否定文档数大于肯定文档数在训练数据集中,我们系统的分类精度提高了95%,并且比众所周知的朴素贝叶斯和k个最近邻分类器更高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号