BackgroundSince medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data.ObjectiveIn this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data.MethodsThe proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion.ResultsA total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words.ConclusionsOur data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.
Background Chronic disease management is a major health issue worldwide. With the paradigm shift to preventive medicine, disease prediction modeling using machine learning is gaining importance for precise and accurate medical judgement. Objective This study aimed to develop high-performance prediction models for 4 chronic diseases using the common data model (CDM) and machine learning and to confirm the possibility for the extension of the proposed models. Methods In this study, 4 major chronic diseases—namely, diabetes, hypertension, hyperlipidemia, and cardiovascular disease—were selected, and a model for predicting their occurrence within 10 years was developed. For model development, the Atlas analysis tool was used to define the chronic disease to be predicted, and data were extracted from the CDM according to the defined conditions. A model for predicting each disease was built with 4 algorithms verified in previous studies, and the performance was compared after applying a grid search. Results For the prediction of each disease, we applied 4 algorithms (logistic regression, gradient boosting, random forest, and extreme gradient boosting), and all models show greater than 80% accuracy. As compared to the optimized model’s performance, extreme gradient boosting presented the highest predictive performance for the 4 diseases (diabetes, hypertension, hyperlipidemia, and cardiovascular disease) with 80% or greater and from 0.84 to 0.93 in area under the curve standards. Conclusions This study demonstrates the possibility for the preemptive management of chronic diseases by predicting the occurrence of chronic diseases using the CDM and machine learning. With these models, the risk of developing major chronic diseases within 10 years can be demonstrated by identifying health risk factors using our chronic disease prediction machine learning model developed with the real-world data–based CDM and National Health Insurance Corporation examination data that individuals can easily obtain.
Background Pill image recognition systems are difficult to develop due to differences in pill color, which are influenced by external factors such as the illumination from and the presence of a flash. Objective In this study, the differences in color between reference images and real-world images were measured to determine the accuracy of a pill recognition system under 12 real-world conditions (ie, different background colors, the presence and absence of a flash, and different exposure values [EVs]). Methods We analyzed 19 medications with different features (ie, different colors, shapes, and dosages). The average color difference was calculated based on the color distance between a reference image and a real-world image. Results For images with black backgrounds, as the EV decreased, the top-1 and top-5 accuracies increased independently of the presence of a flash. The top-5 accuracy for images with black backgrounds increased from 26.8% to 72.6% when the flash was on and increased from 29.5% to 76.8% when the flash was off as the EV decreased. However, the top-5 accuracy increased from 62.1% to 78.4% for images with white backgrounds when the flash was on. The best top-1 accuracy was 51.1% (white background; flash on; EV of +2.0). The best top-5 accuracy was 78.4% (white background; flash on; EV of 0). Conclusions The accuracy generally increased as the color difference decreased, except for images with black backgrounds and an EV of −2.0. This study revealed that background colors, the presence of a flash, and EVs in real-world conditions are important factors that affect the performance of a pill recognition model.
BACKGROUND Chronic disease management is a major health issue worldwide. OBJECTIVE This study suggests the possibility of preemptive management of chronic diseases by predicting the occurrence of chronic diseases using CDM and machine learning. In this study, four major chronic diseases, namely, diabetes, hypertension, hyperlipidemia, and cardiovascular disease, were selected and a model for predicting their occurrence within 10 years was developed. METHODS We used 4 algorithms to predict disease occurrence. RESULTS XGBoost presented the highest predictive performance for the 4 diseases (diabetes, hypertension, hyperlipidemia, cardiovascular disease) of 80% or more —0.84 to 0.93 in AUC standards—showing the best performance. CONCLUSIONS Through the chronic disease prediction machine learning model developed in this study using RWD-based CDM, even with the National Health Insurance Corporation examination data that can be easily obtained by individuals, the risk of major chronic diseases within 10 years Demonstrate that you can specifically identify your health risk factors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.