Semi-automatic creation of a reference news corpus for fine-grained multi-label scenarios

机译：半自动创建用于细粒度多标签方案的参考新闻语料库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods - one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic - for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.

机译：在本文中，我们解决了在细粒度多标签方案中为新闻项的分类创建参考语料库的问题。这些方案对于文本分类技术尤其具有挑战性，并且参考语料库的可用性是开发和测试新分类策略的重要瓶颈之一。我们提出一种用于创建参考语料库的半自动方法，该方法使用三种辅助分类方法-一种基于支持向量机，一种基于最近邻分类器，另一种基于基于字典的分类试探法-用于向人类注释者建议与主题相关的标签可以用来描述要注释的给定新闻项的不同方面。使用这种方法，我们半自动生成具有865个不同标签的1600个新闻条目的语料库，每个新闻条目平均有3.63个标签。我们评估了每种辅助分类方法对注释过程的贡献，并得出以下结论：（i）没有任何一种方法能够单独提示所有相关标签，（ii）基于字典的分类启发法显着地贡献了这一点，并且（iii））最近邻分类器在问题的最极端的多标签部分中非常有效地执行，并且对于非常不平衡的项目到类的分发具有鲁棒性。

著录项

来源
《6th Iberian Conference on Information Systems and Technologies》|2011年|p.1-7|共7页
会议地点
作者
Teixeira Jorge; Sarmento Luis; Oliveira Eugenio;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41;
关键词

相似文献

外文文献
中文文献
专利

1. Semi-Automatic Creation of Youth Slang Corpus and Its Application to Affective Computing [J] . Fuji Ren, Kazuyuki Matsumoto Affective Computing, IEEE Transactions on . 2016,第2期

机译：S语语料库的半自动创建及其在情感计算中的应用
2. Semi-Automatic Bilingual Corpus Creation with Zero Entropy Alignments [J] . Algirdas LAUKAITIS, Olegas VASILECAS, Ricardas LAUKAITIS, Informatica . 2011,第2期

机译：零熵对齐的半自动双语语料库创建
3. Invisible or high-risk: Computer-assisted discourse analysis of references to Aboriginal and Torres Strait Islander people(s) and issues in a newspaper corpus about diabetes [J] . Monika Bednarek PLoS One . 2020,第6期

机译：看不见的或高风险：计算机辅助话语分析对原住民和托雷斯海峡岛民人员的参考文献和报纸语料库中的关于糖尿病的问题
4. Semi-Automatic Creation of a Reference News Corpus for Fine-Grained Multi-Label Scenarios [C] . Jorge Teixeira, Luis Sarmento, Eugenio Oliveira Iberian Conference on Information Systems and Technologies . 2011

机译：半自动创建参考新闻语料库，用于细粒度的多标签方案
5. Corpus linguistics, contextual collocation and ESP syllabus creation: A text analysis approach to the study of medical research articles. [D] . Jabbour, Georgette N. 1998

机译：语料库语言学，语境搭配和ESP教学大纲创建：医学研究文章研究的文本分析方法。
6. Invisible or high-risk: Computer-assisted discourse analysis of references to Aboriginal and Torres Strait Islander people(s) and issues in a newspaper corpus about diabetes [O] . Monika Bednarek 2020

机译：无形或高风险：计算机辅助话语分析原住民和托雷斯海峡岛民人民的参考文献和报纸语料库中的关于糖尿病的问题
7. Semi-Automatic Bilingual Corpus Creation with Zero Entropy Alignments [O] . Algirdas Laukaitis, Olegas Vasilecas, Ricardas Laukaitis, 2011

机译：半自动双语语料库创建零熵对齐

Semi-automatic creation of a reference news corpus for fine-grained multi-label scenarios

摘要

著录项

相似文献

相关主题

期刊订阅