...
首页> 外文期刊>Current Science: A Fortnightly Journal of Research >Classification of SDSS photometric data using machine learning on a cloud
【24h】

Classification of SDSS photometric data using machine learning on a cloud

机译:使用机器学习在云中的SDSS光度数据分类

获取原文
获取原文并翻译 | 示例
           

摘要

Astronomical datasets are typically very large, and manually classifying the data in them is effectively impossible. We use machine learning algorithms to provide classifications (as stars, quasars and galaxies) for more than one billion objects given photometrically in the Third Data Release of the Sloan Digital Sky Survey (SDSS-III). We have used kNN, SVM and random forest algorithms in a distributed environment over the cloud to classify 1,183,850,913 unclassified photometric objects present in the SDSSIII catalog. This catalog contains photometric data for all objects viewed through a telescope and spectroscopic data for a small part of these. Although it is possible to classify all the objects using spectroscopic data, it is impractical to obtain such data for each one of them. To classify such a big dataset on a single machine would be impractically slow, so we have used the Spark cluster computing framework to implement a distributed computing environment over the cloud. We found that writing results (dozens of gigabytes) to the cloud storage is very slow while using kNN. Though writing the results with SVM is faster as it is done in parallel, its accuracy is only around 87%, due to lack of a kernel implementation of it in Spark. We then used the random forest algorithm to classify the entire set of 1,183,850,913 objects with an accuracy of 94% in about 17 hours of processing time. The result set is significant as even collecting spectroscopic data for these many objects would take decades, and our classifications can help astronomers and astrophysicists carry out further studies.
机译:天文数据集通常非常大,并且手动对它们中的数据进行分类,实际上是不可能的。我们使用机器学习算法为在Sloan数字天空调查(SDSS-III)的第三数据释放中,提供超过10亿物体的分类(作为星星,Quasar和星系),以上给出了一亿多物体(SDSS-III)。我们在云上使用了KNN,SVM和随机森林算法在分布式环境中进行了分类,以分类SDSSIII目录中存在的1,183,850,913个未分类的光度法对象。此目录包含通过望远镜查看的所有对象的光度数据,以及这些对象的所有物体以及用于这些小部分的光谱数据。虽然可以使用光谱数据对所有对象进行分类,但是从获取它们中的每一个数据是不切实际的。要对单个计算机上的此类大数据集进行分类,因此是不切实际的缓慢,因此我们使用了Spark Cluster Computing框架来实现云上的分布式计算环境。我们发现使用KNN时,将结果(数十个千兆字节)写入云存储非常慢。虽然用SVM写出结果并行时更快,但其精度仅为87%,因为它以火花缺乏核心实施。然后,我们使用随机林算法对整个1,183,850,913个对象进行分类,精度为约17小时的处理时间。结果集是显着的,甚至收集这些许多物体的光谱数据将花费数十年,我们的分类可以帮助天文学家和天体物理学家进行进一步的研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号