Background Phenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep-phenotyping method for non-English EHRs (ie, Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data that are suitable for developing deep-phenotyping methods are limited. It is challenging to develop a deep-phenotyping method for Chinese EHRs in such a low-resource scenario. Objective In this study, we aimed to develop a deep-phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data. Methods The core of the methodology was to identify linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and perform deep phenotyping of Chinese EHRs by recognizing linguistic patterns in free text. Specifically, 1000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (Semantic Structured Unit of Phenotypes). The annotation data set was randomly divided into a training set (n=700, 70%) and a testing set (n=300, 30%). The process for mining linguistic patterns was divided into three steps. First, free text in the training set was encoded as single-letter sequences (P: phenotype, A: attribute). Second, a biological sequence analysis tool—MEME (Multiple Expectation Maximums for Motif Elicitation)—was used to identify motifs in the single-letter sequences. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep-phenotyping method for Chinese EHRs, including a deep learning–based method for named entity recognition and a pattern recognition–based method for attribute prediction. Results In total, 51 sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions could be learned from a mean of 134 (SD 9.7) annotated EHRs in the training set. The deep-phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1 score of 0.898 with the Bidirectional Encoder Representations from Transformers–bidirectional long short-term memory and conditional random field model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern–based method. Conclusions We developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non–English-speaking countries.
BACKGROUND Phenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep phenotyping method for non-English EHRs (such as Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data suitable for developing deep phenotyping methods are limited. It is a great challenge to develop a deep phenotyping method for Chinese EHRs in such a low-resource scenario. OBJECTIVE In the study, we aimed to develop a deep phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data. METHODS The core of the methodology was to learn linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and then perform deep phenotyping of Chinese EHRs by recognizing learned linguistic patterns in free text. Specifically, 1,000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (the Semantic Structured Unit of Phenotypes). The annotation dataset was randomly divided into a training set (70%) and a testing set (30%). The process for mining linguistic patterns could be divided into three steps: First, free text in the training set was encoded as a single-letter sequence (P: phenotype, A: attribute). Second, a biological sequence analysis tool named MEME motif discovery was used to identify motifs in the single-letter sequence. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep phenotyping method for Chinese EHRs, including a deep learning–based model for named entity recognition and a pattern recognition-based method for attribute prediction. RESULTS Fifty-one sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions might be learned from 134 (+/−9.7) annotated EHRs in the training set. The deep phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1-score of 0.898 with the BERT-BiLSTM-CRF model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern-based method. CONCLUSIONS We developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non-English-speaking countries.
UNSTRUCTURED In “Constructing High-Fidelity Phenotype Knowledge Graphs for Infectious Diseases With a Fine-Grained Semantic Information Model: Development and Usability Study” (J Med Internet Res 2021;23(6):e26892) the authors noted one error. The institution name of affiliation “Suzhou Institute of Systems Medicine” was not correct. It should be corrected from “Suzhou Institute of Systems Medicine” to “Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College; Suzhou Institute of Systems Medicine”
BACKGROUND Phenotypes characterize clinical manifestations of disease, which provide important information for diagnosis. Therefore, constructing phenotype knowledge graphs of disease is valuable to the development of artificial intelligence in medicine. However, phenotype knowledge graphs in current knowledge bases such as WikiData and DBpedia are coarse-grained knowledge graphs, because they only consider core concepts of phenotypes but neglects details (attributes) associated with phenotypes. OBJECTIVE To characterize details of disease phenotypes in clinical guidelines, we proposed a fine-grained semantic information model named PhenoSSU (Semantic Structured Unit of Phenotypes). METHODS PhenoSSU is an "entity-attribute-value" model by its very nature, which aims to capture full semantics underlying phenotype descriptions with a series of attributes and values. 193 clinical guidelines of infectious diseases from Wikipedia were selected as the study corpus, and 12 attributes from SNOMED-CT were introduced into the PhenoSSU model based on co-occurrences of phenotype concepts and attribute values. The expressive power of the PhenoSSU model was evaluated by analyzing whether a PhenoSSU instance could capture full semantic underlying the corresponding phenotype description. To automatically construct fine-grained phenotype knowledge graphs, A hybrid strategy that firstly recognized phenotype concepts with the MetaMap tool and then predicted attribute values of phenotypes with machine learning classifiers was developed. RESULTS Fine-grained phenotype knowledge graphs of 193 infectious diseases were manually constructed with the BRAT annotation tool. It was found that the PhenoSSU model could precisely represent 89.5% (3757/4020) of phenotype descriptions in clinical guidelines. By comparison, other information models such as the Clinical Element Model and the HL7 FHIR model could only capture full semantics underlying 48.4% and 21.8% of phenotype descriptions, respectively. The hybrid strategy achieved an F1-score of 0.732 for the subtask of phenotype concept recognition and an average weighted accuracy of 0.776 for the subtask of attribute value prediction. CONCLUSIONS PhenoSSU is an effective information model for the precise representation of phenotype knowledge in clinical guidelines, and machine learning can be used to improve efficiency for constructing PhenoSSU-based knowledge graphs. Our work will potentially benefit knowledge-based systems for diagnosis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.