Background and Aims: Chronic hepatitis B (CHB) affects >290 million persons globally, and only 10% have been diagnosed, presenting a severe gap that must be addressed. We developed logistic regression (LR) and machine learning (ML; random forest) models to accurately identify patients with HBV, using only easily obtained demographic data from a populationbased data set.
Approach and Results:We identified participants with data on HBsAg, birth year, sex, race/ethnicity, and birthplace from 10 cycles of the National Health and Nutrition Examination Survey (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018) and divided them into two cohorts: training (cycles 2, 3, 5, 6, 8, and 10; n = 39,119) and validation (cycles 1, 4, 7, and 9; n = 21,569). We then developed and tested our two models. The overall cohort was 49.2% male, 39.7% White, 23.2% Black, 29.6% Hispanic, and 7.5% Asian/other, with a median birth year of 1973. In multivariable logistic regression, the following factors were associated with HBV infection: birth year 1991 or after (adjusted OR [aOR], 0.28; p < 0.001); male sex (aOR, 1.49; p = 0.0080); Black and Asian/other versus White (aOR, 5.23 and 9.13; p < 0.001 for both); and being USA-born (vs. foreign-born; aOR, 0.14; p < 0.001). We found that the ML model consistently outperformed the LR model, with higher area under the receiver operating characteristic values (0.83 vs. 0.75 in validation cohort; p < 0.001) and better differentiation of highand low-risk persons.
Conclusions:Our ML model provides a simple, targeted approach to HBV screening, using only easily obtained demographic data.