首页> 外文会议>6th Iberian Conference on Information Systems and Technologies >Semi-automatic creation of a reference news corpus for fine-grained multi-label scenarios
【24h】

Semi-automatic creation of a reference news corpus for fine-grained multi-label scenarios

机译:半自动创建用于细粒度多标签方案的参考新闻语料库

获取原文

摘要

In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods - one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic - for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.
机译:在本文中,我们解决了在细粒度多标签方案中为新闻项的分类创建参考语料库的问题。这些方案对于文本分类技术尤其具有挑战性,并且参考语料库的可用性是开发和测试新分类策略的重要瓶颈之一。我们提出一种用于创建参考语料库的半自动方法,该方法使用三种辅助分类方法-一种基于支持向量机,一种基于最近邻分类器,另一种基于基于字典的分类试探法-用于向人类注释者建议与主题相关的标签可以用来描述要注释的给定新闻项的不同方面。使用这种方法,我们半自动生成具有865个不同标签的1600个新闻条目的语料库,每个新闻条目平均有3.63个标签。我们评估了每种辅助分类方法对注释过程的贡献,并得出以下结论:(i)没有任何一种方法能够单独提示所有相关标签,(ii)基于字典的分类启发法显着地贡献了这一点,并且(iii) )最近邻分类器在问题的最极端的多标签部分中非常有效地执行,并且对于非常不平衡的项目到类的分发具有鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号