首页> 外文期刊>Knowledge-Based Systems >Projected-prototype based classifier for text categorization
【24h】

Projected-prototype based classifier for text categorization

机译:基于投影原型的文本分类器

获取原文
获取原文并翻译 | 示例

摘要

Currently, the explosive increasing of data stimulates a greater demand for text categorization. The existing prototype-based classifiers, including k-NN, kNNModel and Centroid classifier, are receiving wide interest from the text mining community because of their simplicity and efficiency. However, they usually perform less effectively on document data sets due to high dimensionality and complex class structures these sets involve. In most cases a single document category actually contains multiple subtopics, indicating that the documents in the same class may comprise multiple subclasses, each associated with its individual term subspace. In this paper, a novel projected-prototype based classifier is proposed for text categorization, in which a document category is represented by a set of prototypes, each assembling a representative for the documents in a subclass and its corresponding term subspace. In the classifier's training process, the number of prototypes and the prototypes themselves are learned using a newly developed feature-weighting algorithm, in order to ensure that the documents belonging to different subclasses are separated as much as possible when projected onto their own subspaces. Then, in the testing process, each test document is classified in terms of its weighted distances from the different prototypes. Experimental results on the Reuters-21578 and 20-Newsgroups corpora show that the proposed classifier based on the multi-representative-dependent projection method can achieve higher classification accuracy at a lower computational cost than the conventional prototype-based classifiers, especially for data sets that include overlapping document categories.
机译:当前,数据的爆炸性增长激发了对文本分类的更大需求。现有的基于原型的分类器,包括k-NN,kNNModel和Centroid分类器,由于其简单性和效率而受到文本挖掘社区的广泛关注。但是,由于它们涉及的高维和复杂的类结构,它们通常在文档数据集上的执行效率较低。在大多数情况下,单个文档类别实际上包含多个子主题,这表明同一类中的文档可能包含多个子类,每个子类都与其各自的术语子空间相关联。在本文中,提出了一种新颖的基于投影原型的分类器用于文本分类,其中文档类别由一组原型表示,每个原型在一个子类及其对应的术语子空间中组装一个代表。在分类器的训练过程中,使用新开发的特征加权算法来学习原型数量和原型本身,以确保将属于不同子类的文档投影到自己的子空间时尽可能地分开。然后,在测试过程中,根据每个测试文档与不同原型的加权距离对其进行分类。对Reuters-21578和20-Newsgroups语料库的实验结果表明,与传统的基于原型的分类器相比,基于多代表相关投影方法的分类器可以以较低的计算成本实现更高的分类精度,尤其是对于那些包括重叠的文档类别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号