Objective Natural language processing (NLP) systems convert unstructured text into analyzable data. Here, we describe the performance measures of NLP to capture granular details on nodules from thyroid ultrasound (US) reports and reveal critical issues with reporting language.
Methods We iteratively developed NLP tools using clinical Text Analysis and Knowledge Extraction System (cTAKES) and thyroid US reports from 2007 to 2013. We incorporated nine nodule features for NLP extraction. Next, we evaluated the precision, recall, and accuracy of our NLP tools using a separate set of US reports from an academic medical center (A) and a regional health care system (B) during the same period. Two physicians manually annotated each test-set report. A third physician then adjudicated discrepancies. The adjudicated “gold standard” was then used to evaluate NLP performance on the test-set.
Results A total of 243 thyroid US reports contained 6,405 data elements. Inter-annotator agreement for all elements was 91.3%. Compared with the gold standard, overall recall of the NLP tool was 90%. NLP recall for thyroid lobe or isthmus characteristics was: laterality 96% and size 95%. NLP accuracy for nodule characteristics was: laterality 92%, size 92%, calcifications 76%, vascularity 65%, echogenicity 62%, contents 76%, and borders 40%. NLP recall for presence or absence of lymphadenopathy was 61%. Reporting style accounted for 18% errors. For example, the word “heterogeneous” interchangeably referred to nodule contents or echogenicity. While nodule dimensions and laterality were often described, US reports only described contents, echogenicity, vascularity, calcifications, borders, and lymphadenopathy, 46, 41, 17, 15, 9, and 41% of the time, respectively. Most nodule characteristics were equally likely to be described at hospital A compared with hospital B.
Conclusions NLP can automate extraction of critical information from thyroid US reports. However, ambiguous and incomplete reporting language hinders performance of NLP systems regardless of institutional setting. Standardized or synoptic thyroid US reports could improve NLP performance.