Background: There is a need for high yield HIV testing strategies to reach epidemic control. We aimed to predict the HIV status of individuals based on socio-behavioural characteristics.
Methods: We analysed over 3,200 variables from the most recent Demographic Health Survey from 10 countries in East and Southern Africa. We trained four machine-learning algorithms and selected the best based on the f1 score. Training and validation were done on 80% of the data. The model was tested on the remaining 20% and on a left-out country which was rotated around. The best algorithm was retrained on the variables which were most predictive. We studied two scenarios: one aiming to identify 95% of people living with HIV (PLHIV) and one aiming to identify individuals with 95% or higher probability of being HIV positive.
Findings: Overall 55,151 males and 69,626 females were included. XGBoost performed best in predicting HIV with a mean f1 of 76.8% [95% confidence interval 76.0%-77.6%] for males and 78.8% [78.2%-79.4%] for females. Among the ten most predictive variables, nine were identical for both sexes: longitude, latitude and, altitude of place of residence, current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom use during last intercourse and, wealth index. Model performance based on these variables decreased minimally. For the first scenario, 7 males and 5 females would need to be tested to identify one HIV positive person. For the second scenario, 4.2% of males and 6.2% of females would have been identified as high-risk population.
Interpretation: We were able to identify PLHIV and those at high risk of infection who may be offered pre-exposure prophylaxis and/or voluntary medical male circumcision. These findings can inform the implementation of HIV prevention and testing strategies.
Funding: Swiss National Science Foundation.