首页> 外文期刊>Information Processing & Management >Text categorization based on k-nearest neighbor approach for Web site classification
【24h】

Text categorization based on k-nearest neighbor approach for Web site classification

机译:基于k近邻法的文本分类用于网站分类。

获取原文
获取原文并翻译 | 示例

摘要

Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in previous research. To implement our proposed method, we derive a scheme for Web site classification based on the k-nearest neighbor (k-NN) approach. It consists of three phases: Web page selection (connectivity analysis), Web page classification, and Web site classification. Given a Web site, the Web page selection chooses several representative Web pages using connectivity analysis. The k-NN classifier next classifies each of the selected Web pages. Finally, the classified Web pages are extended to a classification of the entire Web site. To improve performance, we supplement the k-NN approach with a feature selection method and a term weighting scheme using markup tags, and also reform its document-document similarity measure. In our experiments on a Korean commercial Web directory, the proposed system, using both a home page and its linked pages, improved the performance of micro-averaging breakeven point by 30.02%, compared with an ordinary classification which uses a home page only.
机译:自动分类是解决万维网上缩放问题的一种可行方法。对于网站分类,本文提出了以不同于先前研究中唯一使用主页的方式来使用与主页链接的网页。为了实现我们提出的方法,我们基于k最近邻居(k-NN)方法推导了网站分类方案。它包括三个阶段:网页选择(连接性分析),网页分类和网站分类。在给定网站的情况下,网页选择使用连通性分析选择几个代表性的网页。接下来,k-NN分类器对每个选定的网页进行分类。最终,将分类的网页扩展到整个网站的分类。为了提高性能,我们用特征选择方法和使用标记标签的术语加权方案对k-NN方法进行了补充,还改革了其文档-文档相似性度量。在我们对韩国商业Web目录的实验中,与仅使用主页的普通分类相比,使用主页及其链接页面的拟议系统将微平均收支平衡点的性能提高了30.02%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号