Modern applications have become increasingly complex and their manual installation and configuration is
no longer practical. Instead, IT organizations heavily rely on Infrastructure as Code (IaC) technologies, to automate the provisioning, configuration, and maintenance of computing infrastructures and systems. IaC systems typically offer declarative, domain-specific languages (DSLs) that allow system administrators and developers to write high-level programs that specify the desired state of their infrastructure in a reliable, predictable, and documented fashion. Just like traditional programs, IaC software is not immune to faults, with issues ranging from deployment failures to critical misconfigurations that often impact production systems used by millions of end users. Surprisingly, despite its crucial role in global infrastructure management, the tooling and techniques for ensuring IaC reliability still have room for improvement.
In this work, we conduct a comprehensive analysis of 360 bugs identified in IaC software within prominent IaC ecosystems including Ansible, Puppet, and Chef. Our work is the first in-depth exploration of bug characteristics in these widely-used IaC environments. Through our analysis we aim to understand: (1) how these bugs manifest, (2) their underlying root causes, (3) their reproduction requirements in terms of system state (e.g., operating system versions) or input characteristics, and (4) how these bugs are fixed. Based on our findings, we evaluate the state-of-the-art techniques for IaC reliability, identify their limitations, and provide a set of recommendations for future research. We believe that our study helps researchers to (1) better understand the complexity and peculiarities of IaC software, and (2) develop advanced tooling for more reliable and robust system configurations.