首页> 外文会议> >A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization
【24h】

A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization

机译:基于质心,基于邻域和统计方法的有效文档分类的比较研究

获取原文

摘要

Associating documents to relevant categories is critical for effective document retrieval. Here, we compare the well-known k-nearest neighborhood (kNN) algorithm, the centroid-based classifier and the highest average similarity over retrieved documents (HASRD) algorithm, for effective document categorization. We use various measures such as the micro and macro F1 values to evaluate their performance on the Reuters-21578 corpus. The empirical results show that kNN performs the best, followed by our adapted HASRD and the centroid-based classifier for common document categories, while the centroid-based classifier and kNN outperform our adapted HASRD for rare document categories. Additionally, our study clearly indicates that each classifier performs optimally only when a suitable term weighting scheme is used All these significant results lead to many exciting directions for future exploration.
机译:将文档与相关类别相关联对于有效地检索文档至关重要。在这里,我们比较了众所周知的k最近邻算法(kNN),基于质心的分类器和检索文档的最高平均相似度(HASRD)算法,以进行有效的文档分类。我们使用诸如微观和宏观F1值之类的各种指标来评估其在Reuters-21578语料库上的表现。实证结果表明,对于常用文档类别,kNN表现最佳,其次是我们的自适应HASRD和基于质心的分类器,而对于罕见文档类别,基于质心的分类器和kNN优于我们的自适应HASRD。此外,我们的研究清楚地表明,只有在使用合适的术语加权方案时,每个分类器才能发挥最佳性能。所有这些重要结果为未来的探索提供了许多令人振奋的方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号