利用带无标签数据的双支持向量机对恒星光谱分类

刘忠宝; 雷宇飞; 宋文爱; 张静; 王杰; 屠良平

摘要

恒星光谱分类是天文技术与方法领域一直关注的热点问题之一.随着观测设备持续运行和不断改进, 人类获得的光谱数量与日俱增.这些海量光谱为人工处理带来了极大挑战.鉴于此, 研究人员开始关注数据挖掘算法, 并尝试对这些光谱进行数据挖掘.近年来, 神经网络、自组织映射、关联规则等数据挖掘方法广泛应用于恒星光谱分类.在这些方法中, 支持向量机 (SVM) 以其强大的学习能力和高效的分类性能而备受推崇.SVM的基本思想是试图在两类样本之间找到一个最优分类面将两类分开.SVM在求解时, 通过将其最优化问题转化为具有 (QP) 形式的凸问题, 进而得到全局最优解.尽管该方法在实际应用中表现优良, 但为了进一步提高其分类能力, 有的学者提出双支持向量机 (TSVM).该方法通过构造两个非平行的分类面将两类分开, 每一类靠近某个分类面, 而远离另一个分类面.TSVM的计算效率较之传统SVM提高近4倍, 因此, 自TSVM提出后便受到研究人员的持续关注, 并出现若干改进算法.在恒星光谱分类中, 一般分类算法都是根据历史观测光谱来建立分类模型, 其中最关键的是对光谱进行人工标注, 这项工作极为繁琐, 且容易犯错.如何利用已标记的光谱以及部分无标签的光谱来建立分类模型显得尤为重要.因此, 提出带无标签数据的双支持向量机 (TSVMUD) 用以实现对恒星光谱智能分类的目的.该方法首先将光谱分为训练数据集和测试数据集两部分;然后, 在训练集上进行学习, 得到分类依据;最后利用分类依据对测试集上的光谱进行验证.继承了双支持向量机的优势, 更重要的是, 在训练集上学习分类模型过程中, 不仅考虑有标记的训练样本, 也考虑部分未标记的样本.一方面提高了学习效率, 另一方面得到更优的分类模型.在SDSS DR8恒星光谱数据集上的比较实验表明, 与支持向量机SVM、双支持向量机TSVM以及K近邻 (KNN) 等传统分类方法相比, 带无标签数据的双支持向量机TSVMUD具有更优的分类能力.然而, 该方法亦存在一定的局限性, 其中一大难题是其无法处理海量光谱数据.该工作将借鉴海量数据随机采样思想, 利用大数据处理技术, 来对所提方法在大数据环境下的适应性展开进一步研究.%Stellar spectra classification is one of hot spots in astronomical techniques and methods. With continuous operation and improvement of observation apparatus, hundreds and thousands of spectra were obtained by researchers, which presented challenges to process them manually. In view of this, data mining algorithms have attracted more attentions, and have been utilized to deal with the spectra. Neural networks, self organization mapping, association rules and other data mining algorithms have been utilized to classify the stellar spectra in recent years. In these algorithms, Support Vector Machine (SVM) is much more popular due to its good learning capability and excellent classification performance. The basic idea of standard SVM is to find an optimal separating hyper-plane between the positive and negative samples. SVM as a convex programming problem has a unique optimal solution, which can be posed as a quadratic programming (QP) problem. In order to further improve the classification efficiency, Twin Support Vector Machine (TSVM) has been proposed. It aims at generating two non-parallel hyper-planes such that each plane is close to one class and as far as possible from the other one. The learning speed of TSVM is approximately four times faster than that of the classical SVM. TSVM receives many attentions since it shows low computational complexity, and many variants of TSVM have been proposed in literatures. During the process of stellar spectra classification, the classification model is built based on the observation data. The key step is to manually label the spectra, which is time-consuming and painstaking. Therefore, how to construct the spectra classification model based on the labeled and unlabeled spectra is a problem deserving study. In order to effectively classify the stellar spectra, Twin Support Vector Machine with Unlabeled Data (TSVMUD) is proposed in this paper. In TSVMUD, the stellar spectra are firstly divided into two parts, one is for training, and the other is for test. Then, the proposed method TSVMUD is utilized on the training data and the classification model is obtained. At last, the spectra in the test dataset are verified by the classification model. TSVMUD not only preserve the advantage of low computational complexity, but also improve the classification efficiency by taking both the labeled and unlabeled data into consideration. The comparative experiments on the SDSS datasets verify that TSVMUD performs better than the traditional classifiers, such as SVM, TSVM, KNN (K Nearest Neighbor). However, some limitations exist in TSVMUD, for example, how to deal with the mass spectra is quite difficult to solve. Inspired by random sampling, we will research the adaptability of our proposed method in the big data environment based on big data technologies.

利用带无标签数据的双支持向量机对恒星光谱分类

摘要

著录项

相似文献

相关主题

期刊订阅