Objective
Development of systematic approaches for understanding and assessing data quality is becoming increasingly important as the volume and utilization of health data steadily increases. In this study, a taxonomy of data defects was developed and utilized when automatically detecting defects to assess Medicaid data quality maintained by one of the states in the United States.
Materials and Methods
There were more than 2.23 million rows and 32 million cells in the Medicaid data examined. The taxonomy was developed through document review, descriptive data analysis, and literature review. A software program was created to automatically detect defects by using a set of constraints whose development was facilitated by the taxonomy.
Results
Five major categories and seventeen subcategories of defects were identified. The major categories are missingness, incorrectness, syntax violation, semantic violation, and duplicity. More than 3 million defects were detected indicating substantial problems with data quality. Defect density exceeded 10% in five tables. The majority of the data defects belonged to format mismatch, invalid code, dependency-contract violation, and implausible value types. Such contextual knowledge can support prioritized quality improvement initiatives for the Medicaid data studied.
Conclusions
This research took the initial steps to understand the types of data defects and detect defects in large healthcare datasets. The results generally suggest that healthcare organizations can potentially benefit from focusing on data quality improvement. For those purposes, the taxonomy developed and the approach followed in this study can be adopted.