This thesis concentrates on the theory, implementation, and application of dimension reduction in data mining. Many real-world applications such as text mining, image retrieval, face recognition, and microarray data analysis involve high-dimensional data, where the dimension can often run into the thousands. Traditional machine learning and data mining techniques are not effective when dealing with such high-dimensional data because of the so-called curse of dimensionality. A natural approach to deal with this problem is to apply dimension reduction as a pre-processing step.; The first part of the thesis presents a dimension reduction technique for data in matrix form. The essence of the proposed algorithm is that it applies a bilinear transformation on the data. Such a bilinear transformation is particularly appropriate for data in matrix representation and often leads to lower computational costs compared to traditional algorithms. A natural application of the algorithm is in image compression and retrieval, where each image is represented in its native matrix representation. Extensive experiments performed using image data show that the proposed algorithm outperforms the traditional ones, in terms of computational time and space requirement, while maintaining competitive performance in classification.; The second part of the thesis focuses on generalizing classical Linear Discriminant Analysis (LDA) to overcome problems associated with undersampled data, where the data dimension is much greater than the number of data items. The optimization criterion in classical LDA fails when the scatter matrices are singular, which is the case for undersampled problems. A new optimization criterion has been developed which is applicable to undersampled problems. The algorithms based on the proposed criterion have been shown to be very competitive in classification.; The final part of the thesis considers the problem of designing an efficient and incremental dimension reduction algorithm. An LDA-based incremental dimension reduction algorithm has been developed. Unlike other LDA-based algorithms, this algorithm does not require the whole data matrix in main memory. This is desirable for large datasets. More importantly, with the insertion of new data items, the proposed algorithm can constrain the computational cost by efficient incremental updating techniques. Experiments reveal that the proposed algorithm is competitive in classification, but has much lower computational cost, especially when new data items are inserted dynamically.
展开▼