首页> 外文期刊>Information retrieval >Preferential text classification: learning algorithms and evaluation measures
【24h】

Preferential text classification: learning algorithms and evaluation measures

机译:优先文本分类:学习算法和评估措施

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document d_i, category c' is preferred to category c""; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.
机译:在许多应用性上下文中,文本文档都标有主题类别,在文档的主要类别(代表文档的主要主题)和次要类别(仅代表文档涉及的主题)之间存在区别。我们认为,迄今为止在文本分类研究中被忽略的这一区别很重要,应该予以明确解决。本文的贡献是三方面的。首先,我们针对此优先文本分类任务提出了一种评估措施,其中涉及主要类别或次要类别的不同类型的误分类对有效性产生不同的影响。其次,我们在一个众所周知的专利分类基准上为此任务建立了几个基准结果,其中存在主要类别和次要类别之间的区别;这些结果是通过根据公认的分类问题(例如单标签和/或多标签多分类)重新定义优先文本分类任务而获得的;使用了最新的学习技术,例如SVM和基于内核的方法。第三,我们通过使用最近提议的一类算法来改进这些结果,这些算法是专门为从优先形式表示的训练数据中学习而设计的,即以“对于文档d_i,类别c'比类别c优先”的形式学习;这允许我们不仅在分类阶段而且在学习阶段区分主要和次要类别,从而区分它们对要生成的分类器的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号