Feature selection is a crucial activity when knowledge discovery is applied to very large databases, as it reduces dimensionality and therefore the complexity of the problem. Its main objective is to eliminate attributes to obtain a computationally tractable problem, without affecting the quality of the solution. To perform feature selection, several methods have been proposed, some of them tested over small academic datasets. In this paper we evaluate different feature selection-ranking methods over a very large real world database related with a Mexican electric energy client-invoice system. Most of the research on feature selection methods only evaluates accuracy and processing time; here we also report on the amount of discovered knowledge and stress the issue around the boundary that separates relevant and irrelevant features. The evaluation was done using Elvira and Weka tools, which integrate and implement state of the art data mining algorithms. Finally, we propose a promising feature selection heuristic based on the experiments performed.
展开▼