85.30 Artificial Identities: Analyzing Bias in DALLE3 and ImageFX Generated Images of Surgeons

R. Gorijavolu1, S. Boparai5, E. O’Connell1, V. Mathur2, A.M. Stey4, L.A. Celi2  1Johns Hopkins University School Of Medicine, School Of Medicine, Baltimore, MD, USA 2Massachusetts Institute Of Technology, The Laboratory For Computational Physiology, Cambridge, MA, USA 4Feinberg School Of Medicine – Northwestern University, Department Of Surgery, Chicago, IL, USA 5Hackensack Meridian Health, School Of Medicine, Nutley, NJ, USA

Introduction:
Artificial intelligence (AI) tools are increasingly used to generate images for various applications, including medical visualizations. However, biases in AI-generated imagery can perpetuate stereotypes and misrepresentations. This study aims to compare the representation of surgeons in images generated by two diffusion models, DALLE3 and ImageFX (Imagen 2), with a focus on gender, ethnicity, and other visual attributes.

Methods:
We generated 100 images of surgeons using DALLE3 and 100 images using ImageFX, employing the prompt "surgeon". The DALLE3 images were generated using the OpenAI API, and the ImageFX images were generated on Google's AI Test Kitchen. Both DALLE3 and ImageFX are diffusion models. We analyzed the generated images, examining variables such as gender, ethnicity, hair color, eye color, presence of deformities, use of glasses, operating context, and other visual attributes. The images were categorized and quantified based on these variables. Statistical analyses, including Pearson’s Chi-squared test, Fisher’s exact test, and Wilcoxon rank sum test, were performed to identify significant differences between the two sets of images. The aim was to quantify and compare the biases present in each model.

Results:
The analysis revealed significant disparities between the images generated by DALLE3 and ImageFX. DALLE3 predominantly generated female surgeons (83%), while ImageFX produced mostly male surgeons (96%), indicating a significant gender bias between the two models (p < 0.001). Regarding ethnicity, DALLE3 depicted a higher percentage of surgeons as people of color (44%) compared to ImageFX (2%), revealing a substantial ethnic representation disparity (p < 0.001). Differences in hair color were notable, with DALLE3 showing 70% black hair compared to ImageFX's 18%. Eye color distribution also varied, with DALLE3 showing 80% brown eyes compared to ImageFX's 68% (p = 0.040). Further, DALLE3 images had a higher incidence of image deformities (51%) compared to ImageFX (26%) (p < 0.001). ImageFX depicted more surgeons wearing glasses (73%) compared to DALLE3 (25%) (p < 0.001). In terms of operational context, DALLE3 images frequently showed surgeons actively operating (68%) compared to ImageFX (17%) (p < 0.001). The number of individuals around the surgeon also differed, with DALLE3 images depicting an average of 1.32 individuals, whereas ImageFX showed 0.41 individuals (p < 0.001).

Conclusion:
This study reveals significant biases and variations in the depiction of surgeons by DALLE3 and ImageFX. The disparities in gender, ethnicity, and other visual attributes underscore the need for developing AI models that ensure diverse and accurate representations in visualizations. Addressing these biases is crucial for promoting inclusivity and fairness. The alignment of AI models with societal values, goals, and standards is crucial for fostering trust and equity in AI applications.