Missing data (MD) is a prevalent problem and can negatively affect the trustworthiness of data analysis. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of MD research deals with imputation, i.e., the replacement of missing values with "best guesses". Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, in literature, there is a wide varity of MD patterns without a common formalization. In this paper, we introduce the first formal definition of MD patterns. Building on this theory, we developed a systematic approach on how to automatically detect MD patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.
CCS CONCEPTS• Information systems → Data cleaning; • Mathematics of computing → Multivariate statistics.