首页> 外文会议>Annual German Conference on AI >Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords
【24h】

Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

机译:基于课程和基于语料库的关键字的偏斜和同质文档语料的分类

获取原文

摘要

In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf weighting. In contrast to the common belief, we show that using keywords instead of all words generally yields better performance and tf-idf weighting does not always outperform boolean weighting. Our results reveal that corpus-based approach performs better for large number of keywords while class-based approach performs better for small number of keywords. In skewed datasets, class-based keyword selection performs consistently better than corpus-based approach in terms of macro-averaged F-measure. In homogenous datasets, performances of class-based and corpus-based approaches are similar except for small number of keywords.
机译:在本文中,我们研究了两个策略的表现,以便在不同属性的标准文档语料库中进行关键字选择。虽然在基于语料库的策略中,为全局类中选择单个关键字,但在基于类的策略中,在本地为每个类选择一个不同的关键字。我们使用SVM作为学习方法,并使用布尔和TF-IDF加权进行实验。与常见的信念相比,我们表明,使用关键字而不是所有单词通常会产生更好的性能,而TF-IDF加权并不总是优于布尔加权。我们的结果表明,基于语料库的方法对于大量关键字进行了更好的时间,而基于类的方法对于少量关键字而言更好。在偏斜数据集中,基于类的关键字选择在宏平均f度量方面始终如于基于语料库的方法。除了少量关键字之外,基于基于类的基于语料库的方法的表演是类似的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号