Summary
A data center infrastructure is composed of heterogeneous resources divided into three main subsystems: IT (processor, memory, disk, network, etc.), power (generators, power transformers, uninterruptible power supplies, distribution units, among others), and cooling (water chillers, pipes, and cooling tower). This heterogeneity brings challenges for collecting and gathering data from several devices in the infrastructure. In addition, extracting relevant information is another challenge for data center managers. While seeking to improve the cloud availability, monitoring the entire infrastructure using a variety of (open source and/or commercial) advanced monitoring tools, such as Zabbix, Nagios, Prometheus, CloudWatch, AzureWatch, and others is required. It is often common to use many monitoring systems to collect real‐time data for data center components from different subsystems. Such an environment brings an inherent challenge stemming from the need to aggregate and organize the whole collected infrastructure data and measurements. This first step is necessary prior to obtaining any valuable insights for decision‐making. In this paper, we present the Data Center Availability (DCA) System, a software system that is able to aggregate and analyze data center measurements aimed toward the study of DCA. We also discuss the DCA implementation and illustrate its operation, monitoring a small University research laboratory data center. The DCA System is able to monitor different types of devices using the Zabbix tool, such as servers, switches, and power devices. The DCA System is able to automatically identify the failure time seasonality and trend present in the collected data from different devices of the data center.