05.04 The Use of Large Language Models for the De Novo Generation of Colorectal Education Materials

B.A. Brock1, W. Oslock1, A.A. Harsono1, S.C. Hodges1, L. Wood1, R.H. Hollis1, A. Abbas1, G. Hernandez-Marquez1, M.A. Rubyan2, D.I. Chu1  1University Of Alabama at Birmingham, Department Of Surgery, Birmingham, Alabama, USA 2University Of Michigan, School Of Public Health, Ann Arbor, MI, USA

Introduction: Low health literacy is associated with worse surgical outcomes. Improving the readability of patient education materials offers one strategy to address low health literacy. Large language models (LLM) may provide a scalable way to improve education materials but have not been thoroughly tested. Therefore, the aim of this study was to use LLMs to generate de novo education materials and compare readability to existing materials.

 

Methods: This study utilized existing patient education materials for preoperative, postoperative, and ostomy care from an academic medical center and compared them to de novo generated material from the following LLMs: ChatGPT3.5, Copilot, and Gemini. The following previously tested metric-based prompt was used: “Please give me patient education information about … risks, expectations, and preparation that is health literate and at a sixth-grade reading level using short syllables and words with <3 syllables.” This prompt was entered into each LLM for each topic of existing education material. Original and new materials were then assessed for readability using the Flesch Reading Ease (Ease) and Flesch Kincaid Grade Level (Grade Level). Ease is graded on a scale of 0-100, with higher scores associated with easier readability, while the Grade Level measure approximates the grade level of the text, with lower grade levels associated with easier readability. The scores were then compared to the baseline scores utilizing paired t-tests.

 

Results: A total of 52 education materials were identified from the preoperative, postoperative, and ostomy care pathways, with 52 new materials generated for each LLM (ChatGPT3.5, Copilot, and Gemini for a total n=156). Across all categories, Gemini-generated materials demonstrated improved reading Ease with a mean of 71.5 vs 62.1 baseline (p<0.001). In contrast, both ChatGPT3.5 and Copilot-generated materials with worsened Ease with a mean of 37.0 and 54.4, respectively (both p<0.001). For Grade Level (Figure), Gemini-generated materials were also found to score better for readability with an average Grade Level of 5.9 vs 8.0 baseline (p<0.001). While Copilot’s materials were not statistically different, ChatGPT materials were again worse with an average Grade Level of 12.5 (p<0.001).

 

Conclusion: We found significant variability in readability of education materials generated by different LLMs with Gemini-generated materials having improved reading Ease and reduced Grade Level while ChatGPT worsened these measures. These findings highlight the potential benefits and limitations of different LLMs when improving the readability of education materials.