首页> 外文期刊>Revista de Ingeniería Electrónica, Automática y Comunicaciones >Reducing Vector Space Dimensionality in Automatic Classification for Authorship Attribution
【24h】

Reducing Vector Space Dimensionality in Automatic Classification for Authorship Attribution

机译:减少作者归因自动分类中的向量空间维数

获取原文
           

摘要

For automatic classification, the implications of having too many classificatory features are twofold. On the one hand, features may not be helpful to discriminate classes and should be removed from the classification. On the other hand, redundant features may produce negative effects as their number grows and their detrimental impact should be minimized or limited. In text classification tasks, where word and word-derived features are commonly employed, the number of distinctive features extracted from text samples can grow fast. For the specific context of authorship attribution, a number of features traditionally used, such as n-grams or word sequences, can produce long lists of distinctive features, a great majority of which have very few instances. Previous research has shown that in this task feature reduction can supersede the performance of noise tolerant algorithms to solve the issues associated with the abundance of classificatory features. However, there has been no attempt to show the motivation of this solution. This article shows how even in the small data collections characteristically used in authorship attribution, the frequency rank of common elements remains stable as their instances accumulate and novel, uncommon words are constantly found. Given this general vocabulary property, present even in very small text collections, the application of techniques to reduce vector space dimensionality is especially beneficial across the various experimental settings typical of this task. The implications of this may be helpful for other automatic classification tasks with similar conditions.
机译:对于自动分类,具有太多分类功能的含义是双重的。一方面,要素可能无助于区分类别,应将其从分类中删除。另一方面,冗余特征可能会随着其数量的增加而产生负面影响,并且应将其有害影响降至最低或限制。在文本分类任务中,通常使用单词和单词衍生的特征,从文本样本中提取的独特特征的数量可以快速增长。对于作者属性的特定上下文,传统上使用的许多功能(例如n-gram或单词序列)都可以生成一长串独特的功能,其中绝大多数很少。先前的研究表明,在此任务中,特征简化可以取代耐噪算法的性能,从而解决与大量分类特征相关的问题。但是,没有尝试显示此解决方案的动机。本文说明,即使在作者属性中常用的少量数据集合中,常见元素的频率等级也随着其实例的积累和不断发现的新颖而罕见的单词而保持稳定。鉴于即使在很小的文本集中也具有这种一般的词汇属性,减少矢量空间维数的技术的应用在此任务典型的各种实验设置中特别有益。这对于具有类似条件的其他自动分类任务可能会有所帮助。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号