首页> 外文会议>WSEAS International Conference on Applied Informatics and Communications >A Novel Efficient Classification Algorithm for Search Engines
【24h】

A Novel Efficient Classification Algorithm for Search Engines

机译:一种用于搜索引擎的新型高效分类算法

获取原文

摘要

In this paper a new classification algorithm of Web documents into a set of categories, is proposed. The proposed technique is based on analyzing relationships between different documents and the terms they contain by producing a set of rules relating the category of the document, its terms and their frequencies. Each document is represented by a graph that correlates its most frequent combined words and its category. The relationships among these graphs and the documents' categories are captured. The proposed technique has three phases. The first phase is a training phase where human experts determines the categories of different web pages and articles and the supervised classification algorithm will combine these categories with appropriate weighted index terms according to the highest supported rules among the most frequent words. The second phase is the blind categorization phase where a web crawler will crawl through the World Wide Web to build a database that will be categorized according to the result of the first phase. This data base contains URLs and their categories. The third phase is applying the proposed graph representation technique on the whole set of documents per category to determine its final graph representation. The third phase will produce better classification rules because the sample size is larger with no additional cost of supervised categorization. Experiments using data sets collected from different Web portals are conducted.
机译:在本文中,提出了一种新的Web文档分类算法到一组类别。所提出的技术基于分析不同文档之间的关系和它们包含的术语通过制作与文档类别的一组规则,其术语及其频率相关。每个文档由图表表示,该图表关联其最常用的组合单词及其类别。捕获这些图表和文档类别之间的关系。所提出的技术有三个阶段。第一阶段是人类专家确定不同网页和文章的类别的培训阶段,并且监督分类算法将根据最常见的单词之间的最高支持的规则将这些类别与适当的加权索引项组合。第二阶段是盲分类阶段,其中Web爬网程序将通过万维网爬行,以构建将根据第一阶段的结果进行分类的数据库。此数据库包含URL及其类别。第三阶段正在每个类别的整组文档上应用所提出的图形表示技术,以确定其最终图表表示。第三阶段将产生更好的分类规则,因为样本大小较大,而没有额外的监督分类成本。使用从不同网站门户收集的数据集进行实验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号