ObjectiveReal-world data, including administrative claims and electronic health record (EHR) data, have been critical for rapid-knowledge generation throughout the COVID-19 pandemic. Many studies relied on these data to identify cases and ascertain outcomes., commonly using diagnostic codes. However, to ensure high-quality results are delivered to guide clinical decision making, guide the public health response, and characterize the response to interventions, it is essential to establish the accuracy of these approaches for case identification of infections and hospitalizations.MethodsReal-world EHR data were obtained from the clinical data warehouse and computational health platform at a large academic health system that includes 5 regional hospitals in Connecticut and Rhode Island and their associated ambulatory practices. Demographic information, diagnosis codes, SARS-CoV-2 nucleic acid and antigen testing results, and visit data including discharge disposition were obtained from our OMOP common data model for all patients with either a positive SARS-CoV-2 test or ICD-10 diagnosis of COVID-19 (U07.1) between April 1, 2020 and March 1, 2021. Various computable phenotype definitions using combinations of test results and diagnostic codes were evaluated for their accuracy to identify SARS-CoV-2 infection and COVID-19 hospitalizations. The association with each phenotype was further compared with case volumes and, for hospitalizations, in-hospital mortality. We conducted a quantitative assessment with a manual chart review for a sample of 40 patients who had discordance between diagnostic code and laboratory result findings.ResultsThere were 69,423 individuals with either a diagnosis code or a laboratory diagnosis of a SARS-CoV-2 infection. Of these, 61,023 individuals had a principal or a secondary diagnosis code for COVID-19 and 50,355 had a positive SARS-CoV-2 PCR or antigen test. Among those with a positive PCR, 38,506 (76.5%) also had a principal and 3449 (6.8%) a secondary diagnosis of COVID-19, but 8400 (16.7%) had no COVID-19 diagnosis in the medical record. Moreover, of the 61,023 patients who had a COVID-19 diagnosis, 19,068 (31.2%) did not have a positive laboratory test for SARS-CoV-2 in the EHR. In a manual chart review of this sample of patients, we found that these many had a COVID-19 diagnosis code added during healthcare encounters related to asymptomatic testing, either as part of a screening program or following exposure, but with negative subsequent test results. The positive predictive value (precision) and sensitivity (recall) of a COVID-19 diagnosis in the medical record for a positive SARS-CoV-2 PCR were 68.8% and 83.3%, respectively. Further, among 5,109 patients who were hospitalized with a principal diagnosis of COVID-19, 4843 (94.8%) had a positive SARS-CoV-2 PCR or antigen test within the 2 weeks preceding hospital admission or during hospitalization. In a random sample of 10 without a positive test during the index hospitalization selected for manual chart review, 7 (70.0%) had been tested at an outside laboratory before admission and the remaining had a strong clinical suspicion for COVID-19. In addition, 789 hospitalizations had a secondary diagnosis of COVID-19, of which 446 (56.5%) had a principal diagnosis that was consistent with severe clinical manifestation of COVID-19 (e.g., sepsis or respiratory failure). Compared with the cohort that had a principal diagnosis of COVID-19, those with a secondary diagnosis more frequently male and White and had more than 2-fold higher in-hospital mortality (13.2% vs 28.0%, P<0.001).ConclusionsIn a large integrated health system, COVID-19 diagnosis codes were not adequate for case identification and epidemiological surveillance of SARS-CoV-2 infection. In contrast, a principal diagnosis of COVID-19 diagnosis consistently identified hospitalized patients with the disease but missed nearly 10% of cases that presented with more severe manifestations of disease and had over 2-fold higher mortality. Data from the EHR can provide additional data elements compared to administrative claims alone, such as laboratory testing results, that can be used to in conjunction with diagnostic codes to create more fine-tuned phenotypes that are designed for specific analytical use cases.