79.06 Evaluating ChatGPT Versus Surgeon-Generated Informed Consents for Common Operations

H. C. Decker1, K. Trang1, J. Ramirez1, A. Colley1, L. Pierce2, M. Coleman1, T. Bongiovanni1, G. B. Melton3, E. Wick1  1University Of California – San Francisco, Department Of Surgery, San Francisco, CA, USA 2University Of California – San Francisco, Department Of Medicine, San Francisco, CA, USA 3University of Minnesota, Department Of Surgery, Institute For Health Informatics, And Center For Learning Health System Sciences, Minneapolis, MINNESOTA, USA

Introduction:
While informed consent is a critical component of patient care, it is commonly inadequate to fully provide the patient with a thorough understanding of the risks, benefits and alternatives of a surgical procedure. Electronic consent forms have the potential to facilitate patient comprehension, but the information included in them needs to be accessible to all levels of health literacy, accurate, and complete. It is not known if large language models, like ChatGPT, may enhance informed consent documents.

Methods:
A cross-sectional study comparing randomly selected surgeon-generated risks, benefits, and alternatives (RBAs) used in signed electronic consent forms from a single institution and ChatGPT-generated RBAs for six operations (laparoscopic cholecystectomy, inguinal hernia repair, colectomy, coronary artery bypass graft, knee arthroplasty, and spinal fusion). We provided ChatGPT-3.5 the with the following prompt: “Explain the risks, benefits, and alternatives of [procedure name] to a patient at a 6th grade reading level.” A multi-disciplinary group of surgeons evaluated responses for readability, accuracy, and completeness. For readability measures, we used the previously validated Flesh-Kincaid Level, Gunning Fog Index, Simple Measure of Gobbledygook Score, and the Coleman-Liau Index. For accuracy and completeness, we used a scoring system based on recommendations from Leapfrog, the Joint Commission, and the American College of Surgeons. Scores of surgeon-generated and ChatGPT-generated RBAs were compared using Wilcoxon rank-sum tests.

Results:
Out of 36 RBAs representing one for each of the six operations from ChatGPT and five for each operation generated by surgeons, ChatGPT-generated RBAs trended towards being less complex than surgeon-generated RBAs for every type of surgery, but the difference was not statistically significant (p=0.098). The mean composite completeness and accuracy score for was lower for surgeon-generated RBAs compared to ChatGPT-generated RBAs (1.6, SD=0.5 vs 2.2, SD=0.4; p<0.001). ChatGPT scores were higher than surgeon scores for description of the benefits of surgery (2.3, SD=0.7 vs 1.4, SD=0.7; p<0.001) and alternatives to surgery (2.7, SD= 0.5  vs 1.4 SD=0.7, p<0.001). No difference was observed in ChatGPT versus surgeon-generated scores for description of the risks of surgery (1.7, SD=0.5 vs 1.7, SD=0.4; p=0.3827).

Conclusion:
ChatGPT has the potential to enhance informed consent documentation. If a large language model is embedded in the electronic health record in a HIPAA-compliant manner, it could be used to provide personalized risk language, based on disease severity and underlying conditions.