18.10 Use Of Deep Learning To Identify Peripheral Arterial Disease Cases From Narrative Clinical Notes

A. A. Gonzalez1,2, A. Zolensky2,4, H. D. Aridi1, P. Zhang3, S. Dev2,3  1Indiana University School Of Medicine, Vascular Surgery, Indianapolis, IN, USA 2Regenstrief Institute, Center For Health Services Research, Indianapolis, IN, USA 3The Ohio State University, Computer Science And Engineering, Columbus, OHIO, USA 4University Of Pennsylvania, Computer And Information Science, Philadelphia, PA, USA

Introduction:
Peripheral Arterial Disease (PAD) is the leading cause of amputations in the United States. Despite affecting 8.5 million Americans and more than 200 million people globally, there are significant gaps in awareness by both patients and providers. Ongoing efforts to raise PAD awareness among both the public and healthcare professionals have not met widespread success. Thus, there is a need for alternative methods for identifying PAD patients. One potentially promising strategy leverages natural language processing (NLP) to digitally screen patients for PAD. Prior researchers have used keyword-based searches and billing codes to ascertain which patients have PAD. However, these approaches are inherently limited and may fail to capture patients with undiagnosed PAD. Recent advances in deep learning (DL) have been applied to NLP that allow an algorithm to learn a conceptual representation of peripheral arterial disease without having to apply a strict rule-based algorithm (eg searching for a particular set of keywords for billing codes). Herein, we investigate the use of DL to identify patients with peripheral arterial disease based on unstructured narrative notes from the electronic health record.

Method:

We first created a data set of all patients in two large state-wide health systems with diagnostic or procedural codes (ICD-9/10 or CPT) for PAD. We next subdivided the study population into training, testing, and validation (hold-out) cohorts. We designated each inpatient and outpatient encounter as a case (primary billing code for PAD) or a control (primary billing code for non-PAD diagnosis or procedure).

Next, we evaluated the performance on a PAD binary classification task of keyword search (KS) model to a supervised deep learning (DL) model. The KS model we implemented was based on the currently published state-of-the-art for identifying PAD cases from electronic medical record data. Our DL utilized a pre-trained BioMed-RoBERTa-base with continued pre-training on 2.68 million full-text scientific papers from the Semantic Scholar corpus yielding a final model with 7.55B tokens and 47GB of data. We then evaluated model performance for detecting PAD cases from narrative progress notes.

Results:

Our dataset included 339,049 encounters across 62,854 patients represented by 1,588,202 notes. On the task of correctly identifying patient with PAD information a given encounter, the deep learning model outperformed a keyword search-based model on all performance measures (Table). 

Conclusion:

Our findings suggest that deep learning outperforms keyword search for identifying PAD cases from clinical narratives. Future planned work for this project will extend this algorithm to stratify patients based on clinical scoring systems.