Automated and accurate identification of refugees in healthcare databases is a critical first step to investigate healthcare needs of this vulnerable population and improve health disparities. In this study, we developed a machine-learning method, named refugee identification system (RIS) to address this need. We curated a data set consisting of 103 refugees and 930 non-refugees in Arizona. We compiled de-identified individual-level information including age, primary language, and noise-masked home address, state-level refugee resettlement statistics, and world language statistics. We then performed feature engineering to convert language and masked address into quantitative features. Finally, we built a random forest model to classify refugee and nonrefugees. RIS achieved high classification accuracy (overall accuracy = 0.97, specificity = 0.99, sensitivity = 0.85, positive predictive value = 0.88, negative predictive value = 0.98, and area under receiver operating characteristic curve = 0.98). RIS is customizable for refugee identification outside Arizona. Its application enables large-scale investigation of refugee healthcare needs and improvement of health disparities.
Objective: Automated and accurate identification of refugees in healthcare databases is a critical first step to investigate healthcare needs of this vulnerable population and improve health disparities. This study developed a machine-learning method, named refugee identification system (RIS) that uses features commonly collected in healthcare databases to classify refugees and non-refugees. Materials and Methods: We compiled a curated data set consisting of 103 refugees and 930 non-refugees in Arizona. For each person in the curated data set, we collected age, primary language, and home address. We supplemented individual-level data with state-level refugee resettlement statistics and world language statistics, then performed feature engineering to convert primary language and home address into quantitative features. Finally, we built a random forest model to classify refugee status. Results: Evaluated on holdout testing data, RIS achieved a high classification accuracy of 0.97, specificity of 0.98, sensitivity of 0.88, positive predictive value of 0.83, and negative predictive value of 0.99. The receiver operating characteristic curve had an area under the curve value of 0.96. Discussion and Conclusion: RIS is an automated, accurate, generalizable, and scalable method that can be used to identify refugees in healthcare databases. It enables large-scale investigation of refugee healthcare needs and improvement of health disparities.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.