首页> 外文期刊>Expert Systems with Application >An enhanced ACO algorithm to select features for text categorization and its parallelization
【24h】

An enhanced ACO algorithm to select features for text categorization and its parallelization

机译:一种增强的ACO算法,用于选择特征进行文本分类及其并行化

获取原文
获取原文并翻译 | 示例

摘要

Feature selection is an indispensable preprocessing step for effective analysis of high dimensional data. It removes irrelevant features, improves the predictive accuracy and increases the comprehensibility of the model constructed by the classifiers sensitive to features. Finding an optimal feature subset for a problem in an outsized domain becomes intractable and many such feature selection problems have been shown to be NP-hard. Optimization algorithms are frequently designed for NP-hard problems to find nearly optimal solutions with a practical time complexity. This paper formulates the text feature selection problem as a combinatorial problem and proposes an Ant Colony Optimization (ACO) algorithm to find the nearly optimal solution for the same. It differs from the earlier algorithm by Aghdam et al. by including a heuristic function based on statistics and a local search. The algorithm aims at determining a solution that includes 'n' distinct features for each category. Optimization algorithms based on wrapper models show better results but the processes involved in them are time intensive. The availability of parallel architectures as a cluster of machines connected through fast Ethernet has increased the interest on parallelization of algorithms. The proposed ACO algorithm was parallelized and demonstrated with a cluster formed with a maximum of six machines. Documents from 20 newsgroup benchmark dataset were used for experimentation. Features selected by the proposed algorithm were evaluated using Naive bayes classifier and compared with the standard feature selection techniques. It was observed that the performance of the classifier had been improved with the features selected by the enhanced ACO and local search. Error of the classifier decreases over iterations and it was observed that the number of positive features increases with the number of iterations.
机译:特征选择是有效分析高维数据必不可少的预处理步骤。它消除了不相关的特征,提高了预测准确性,并提高了由对特征敏感的分类器构建的模型的可理解性。为大型域中的问题寻找最佳特征子集变得棘手,许多此类特征选择问题已显示为NP难题。通常针对NP难题设计优化算法,以找到具有实际时间复杂性的近乎最优的解决方案。本文将文本特征选择问题表述为一个组合问题,并提出了一种蚁群优化(ACO)算法来寻找该问题的最佳解。它与Aghdam等人的早期算法不同。通过包含基于统计信息和本地搜索的启发式功能。该算法旨在确定一种针对每个类别包含“ n”个不同特征的解决方案。基于包装器模型的优化算法显示出更好的结果,但是其中涉及的过程非常耗时。作为通过快速以太网连接的机器集群的并行体系结构的可用性,增加了对算法并行化的兴趣。提出的ACO算法经过并行处理,并通过最多由六台机器组成的集群进行了演示。来自20个新闻组基准数据集的文档用于实验。使用朴素贝叶斯分类器评估了由提出的算法选择的特征,并与标准特征选择技术进行了比较。观察到,通过增强的ACO和本地搜索选择的功能,提高了分类器的性能。分类器的误差随着迭代次数的增加而减小,并且观察到正特征的数量随着迭代次数的增加而增加。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号