1557 Background: Patients with prostate cancer are diagnosed through a prostate needle biopsy (PNB). Information contained in PNB pathology reports is critical for informing clinical risk stratification and treatment; however, patient comprehension of PNB pathology reports is low, and formats vary widely by institution. Natural language processing (NLP) models trained to automatically extract key information from unstructured PNB pathology reports could be used to generate personalized educational materials for patients in a scalable fashion and expedite the process of collecting registry data or screening patients for clinical trials. As proof of concept, we trained and tested four NLP models for accuracy of information extraction. Methods: Using 403 positive PNB pathology reports from over 80 institutions, we converted portable document formats (PDFs) into text using the Tesseract optical character recognition (OCR) engine, removed protected health information using the Philter open-source tool, cleaned the text with rule-based methods, and annotated clinically relevant attributes as well as structural attributes relevant to information extraction using the Brat Rapid Annotation Tool. Text pre-processing for classification and extraction was done using Scispacy and rule-based methods. Using a 75:25 train:test split (N = 302, 101), we tested conditional random field (CRF), support vector machine (SVM), bidirectional long-short term memory network (Bi-LSTM), and Bi-LSTM-CRF models, reserving 46 training reports as a validation subset for the latter two models. Model-extracted variables were compared with values manually obtained from the unprocessed PDF reports for clinical accuracy. Results: Clinical accuracy of model-extracted variables is reported in the Table. CRF was the highest performing model, with accuracies of 97% for Gleason grade, 82% for percentage of positive cores ( < 50% vs. ≥50%), 90% for perineural or lymphovascular invasion, and 100% for presence of non-acinar carcinoma histology. On manual review of inaccurate results, model performance was limited by PDF image quality, errors in OCR processing of tables or columns, and practice variability in reporting number of biopsy cores. Conclusions: Our results demonstrate successful proof of concept for the use of NLP models in accurately extracting information from PNB pathology reports, though further optimization is needed before use in clinical practice.[Table: see text]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.