首页> 外文期刊>Technical Gazette >Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets
【24h】

Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

机译:无监督的文本主题相关基因提取大型不平衡数据集

获取原文
       

摘要

There is a common notion that traditional unsupervised feature extraction algorithms follow the assumption that the distribution of the different clusters in a dataset is balanced. However, feature selection is guided by the calculation of similarities among features when topic keywords are extracted from a large number of unmarked, unbalanced text datasets. As a result, the selected features cannot truly reflect the information of the original data set, which thus affects the subsequent performance of classifiers. To solve this problem, a new method of extracting unsupervised text topic-related genes is proposed in this paper. Firstly, a sample cluster group is obtained by factor analysis and a density peak algorithm, based on which the dataset is marked. Then, considering the influence of the unbalanced distribution of sample clusters on feature selection, the CHI statistical matrix feature selection method, which combines average local density and information entropy together, is used to strengthen the features of low-density small-sample clusters. Finally, a related gene extraction method based on the exploration of high-order relevance in multidimensional statistical data is described, which uses independent component analysis to enhance the generalisability of the selected features. In this way, unsupervised text topic-related genes can be extracted from large unbalanced datasets. The results of experiments suggest that the proposed method of extracting unsupervised text topic-related genes is better than existing methods in extracting text subject terms from low-density small-sample clusters, and has higher prematurity and feature dimension-reduction ability.
机译:有一个常见的概念认为,传统的无监督特征提取算法遵循假设数据集中不同群集的分布是平衡的。然而,特征选择是通过从大量未标记的不平衡文本数据集中提取主题关键字时的特征之间的相似性的指导。因此,所选功能无法真正反映原始数据集的信息,从而影响了分类器的后续性能。为了解决这个问题,本文提出了一种提取无监督主题相关基因的新方法。首先,通过因子分析和浓度峰值算法获得样本簇组,基于该数据集标记为此。然后,考虑到样品簇的不平衡分布对特征选择的影响,将平均局部密度和信息熵组合在一起的CHI统计矩阵特征选择方法用于增强低密度小样本簇的特征。最后,描述了基于探索多维统计数据中的高阶相关性探索的相关基因提取方法,其使用独立的分量分析来增强所选特征的不可行能力。以这种方式,可以从大型不平衡数据集中提取无监督的文本主题相关基因。实验结果表明,提取未经监督的文本相关基因的提出方法优于从低密度小样品簇中提取文本主体项的现有方法,并且具有更高的早产和特征尺寸减少能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号