首页> 外文期刊>Expert Systems with Application >Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering
【24h】

Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering

机译:通过将特征选择与特征提取方法集成来进行文本聚类的混合降维

获取原文
获取原文并翻译 | 示例

摘要

High dimensionality of the feature space is one of the major concerns owing to computational complexity and accuracy consideration in the text clustering. Therefore, various dimension reduction methods have been introduced in the literature to select an informative subset (or sublist) of features. As each dimension reduction method uses a different strategy (aspect) to select a subset of features, it results in different feature sublists for the same dataset. Hence, a hybrid approach, which encompasses different aspects of feature relevance altogether for feature subset selection, receives considerable attention. Traditionally, union or intersection is used to merge feature sublists selected with different methods. The union approach selects all features and the intersection approach selects only common features from considered features sublists, which leads to increase the total number of features and loses some important features, respectively. Therefore, to take the advantage of one method and lessen the drawbacks of other, a novel integration approach namely modified union is proposed. This approach applies union on selected top ranked features and applies intersection on remaining features sublists. Hence, it ensures selection of top ranked as well as common features without increasing dimensions in the feature space much. In this study, feature selection methods term variance (TV) and document frequency (DF) are used for features' relevance score computation. Next, a feature extraction method principal component analysis (PCA) is applied to further reduce dimensions in the feature space without losing much information. The effectiveness of the proposed method is tested on three benchmark datasets namely Reuters-21,578, Classic4, and WebKB. The obtained results are compared with TV, DF, and variants of the proposed hybrid dimension reduction method. The experimental studies clearly demonstrate that our proposed method improves clustering accuracy compared to the competitive methods.
机译:由于文本聚类中的计算复杂性和准确性考虑,特征空间的高维性是主要关注的问题之一。因此,文献中引入了各种降维方法以选择特征丰富的特征子集(或子列表)。由于每种降维方法都使用不同的策略(方面)来选择要素的子集,因此对于同一数据集,结果将导致不同的要素子列表。因此,一种混合​​方法(其包含用于特征子集选择的全部特征相关性的不同方面)受到了广泛的关注。传统上,并集或相交用于合并使用不同方法选择的要素子列表。并集方法选择所有特征,而相交方法从考虑的特征子列表中仅选择公共特征,这分别导致增加了特征总数并丢失了一些重要特征。因此,为了利用一种方法的优点而减少另一种方法的缺点,提出了一种新颖的集成方法,即改进的联合。此方法对选定的排名最高的要素应用并集,并对其余要素子列表应用相交。因此,它可以确保在不增加特征空间尺寸的情况下选择排名最高的特征以及常见特征。在这项研究中,特征选择方法术语方差(TV)和文档频率(DF)用于特征的相关性分数计算。接下来,应用特征提取方法主成分分析(PCA)来进一步减小特征空间中的维度,而不会丢失太多信息。在三种基准数据集(路透社21,578,Classic4和WebKB)上测试了该方法的有效性。将获得的结果与TV,DF和提出的混合降维方法的变体进行比较。实验研究清楚地表明,与竞争方法相比,我们提出的方法提高了聚类精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号