Enhancing CTAO Monitoring and Alarm Subsystems in Distributed Environments Using ServiMon
Kevin Munari, Alessandro Costa, Federico Incardona, Emilio Mastriani, Sebastiano Spinello, Stefano Germani, Pietro Bruno
Published: 2025/9/19
Abstract
ServiMon is a scalable data collection and auditing pipeline designed for service-oriented, cost-efficient quality control in distributed environments, including the CTAO monitoring, logging, and alarm subsystems. Developed within a Docker-based architecture, it leverages cloud-native technologies and distributed computing principles to enhance system observability and reliability. At its core, ServiMon integrates key technologies such as Prometheus, Grafana, Kafka, and Cassandra. Prometheus serves as the primary engine for real-time performance metric collection, enabling efficient monitoring across multiple nodes. Grafana provides interactive, service-oriented data visualization, facilitating system performance analysis. Additionally, Kafka and Cassandra expose system metrics via the JMX Exporter, offering critical insights into infrastructure availability and performance. This contribution exposes how ServiMon could provide an enhancement on scalability, security, and efficiency in a distributed computing environment, such as the CTAO monitoring, logging, and alarm subsystems. This integrated approach not only ensures robust real-time monitoring, but also optimizes operational costs. Furthermore, ServiMon's ability to generate large volumes of diverse data over time provides a strong foundation for predictive maintenance. By incorporating stochastic and approximate computing techniques, it enables proactive failure detection and system optimization, minimizing downtime and maximizing telescope availability.