Semi-supervised classification of stars, galaxies and quasars using K-means and random-forest approaches

Vahid Asadi, Hosein Haghi, Akram Hasani Zonoozi

公開日: 2025/7/18

Abstract

Classifying stars, galaxies, and quasars is essential for understanding cosmic structure and evolution; however, the vast data from modern surveys make manual classification impractical, while supervised learning methods remain constrained by the scarcity of labeled spectroscopic data. We aim to develop a scalable, label-efficient method for astronomical classification by leveraging semi-supervised learning (SSL) to overcome the limitations of fully supervised approaches. We propose a novel SSL framework combining K-means clustering with random forest classification. Our method partitions unlabeled data into 50 clusters, propagates labels from spectroscopically confirmed centroids to 95% of cluster members, and trains a random forest on the expanded pseudo-labeled dataset. We applied this to the CPz catalog, containing multi-survey photometric and spectroscopic data, and compared performance with a fully supervised random forest. Our SSL approach achieves F1 scores of 98.8%, 98.9%, and 92.0% for stars, galaxies, and quasars, respectively, closely matching the supervised method with F1 scores of 99.1%, 99.1%, and 93.1%, while outperforming traditional color-cut techniques. The method demonstrates robustness in high-dimensional feature spaces and superior label efficiency compared to prior work. This work highlights SSL as a scalable solution for astronomical classification when labeled data is limited, though performance may be degraded in lower dimensional settings.

Semi-supervised classification of stars, galaxies and quasars using K-means and random-forest approaches | SummarXiv | SummarXiv