33.02 Can ChatGPT Help Us Bill Operative Notes More Accurately?

R.W. Browning3, W.M. Oslock1, L. Wood1, R. Chandra3, Y. Wu3, M. Rubyan2, D. Mosher4, N. Kohli5, D.I. Chu1  1University Of Alabama at Birmingham, Department Of Surgery, Division Of Gastrointestinal Surgery, Birmingham, AL, USA 2University Of Michigan, School Of Public Health, Ann Arbor, MI, USA 3University Of Alabama at Birmingham, Heersink School Of Medicine, Birmingham, AL, USA 4University Of Alabama at Birmingham, School Of Public Health, Birmingham, AL, USA 5University Of Alabama at Birmingham, College Of Arts And Sciences, Birmingham, AL, USA

Introduction:  The use of artificial intelligence within medicine, including administrative activities, has emerged as a rapidly expanding area of research. This is especially promising for small hospitals which lack resources to invest in staff like billers. One area that has not been well studied is the application of large language models to billing from operative notes. Given this, our study aims to evaluate the feasibility of using an accessible large language model to code colorectal operative notes.

Methods:  This retrospective study identified colorectal operations performed in 2022 within our institutional National Surgical Quality Improvement Program (NSQIP) database. Operations were identified based on current procedural terminology (CPT) codes. The operative notes were then de-identified and entered into ChatGPT-3.5 to assign CPT codes using three prompts: The Basic prompt asked: “Please provide the appropriate CPT code for the following operative note.” The Role-based prompt asked: “Take on the role of a medical coder and provide any appropriate CPT codes for the following operative notes.” Lastly an Example-based prompt offered some training: “Attached are three operative notes with appropriate CPT codes. Please provide the appropriate CPT code for the following fourth operative note.” Each prompt was repeated five times per operative note then compared to the NSQIP CPT code to assess for consistency. Both recorded and ChatGPT-assigned CPT codes were then converted into work relative value units (wRVUs) and compared using chi-square tests and ANOVA.

Results: In total, 447 colorectal operative notes were collected with a mean word count of 622. The majority of operations were performed by colorectal surgeons (90.6%). The most common procedures were 44205 (MIS colon resection with ostomy (n=81, 32.9%)) and 44207 (MIS colon resection with low pelvis anastomosis (n=81, 32.9%)). ChatGPT-assigned CPT codes rarely matched NSQIP-recorded ones with poor performance across prompts: 20.0% Basic, 19.8% Role, and 20.8% Example (p=0.83). When a multivariable model of code accuracy was constructed controlling for specialty, word count, prompt, and procedure, prompt was still not significant in improving accuracy. wRVU results were variable. ChatGPT-assigned CPT codes yielded wRVUs near NSQIP-recorded CPT codes with the Basic and Example prompts, differing by only +1.71 and +0.76 wRVUs respectively. In contrast, the Role-based prompt resulted in a mean difference of +11.63 wRVU.

Conclusion: In this study, ChatGPT did not consistently identify CPT codes equivalent to NSQIP-recorded CPT codes. However, ChatGPT was able to identify CPT codes with wRVUs higher than those associated with NSQIP-recorded ones using the Role-based prompt. While further research is needed, these findings suggest an opportunity to address potential lost wRVUs (and therefore revenue) using large language models.