Auditing large language models for race & gender disparities: Implications for artificial intelligence–based hiring

Johann D. Gaebler, Sharad Goel, Aziz Huq, and Prasanna Tambe

Behavioral Science & Policy, 2025.

DOI: 10.1177/23794607251320229.

Abstract

Rapid advances in artificial intelligence (AI), including large language models (LLMs) with abilities that rival those of human experts on a wide array of tasks, are reshaping how people make important decisions. At the same time, critics worry that LLMs may inadvertently discriminate against some groups. To address these concerns, recent regulations call for auditing the LLMs used in important decisions such as hiring. But neither current regulations nor the scientific literature offers clear guidance on how to conduct these audits. In this article, we propose and investigate one approach for auditing algorithms: correspondence experiments, a widely applied tool for detecting bias in human judgments. We applied this method to a range of LLMs instructed to rate job candidates using a novel data set of job applications for K-12 teaching positions in a large American public school district. By altering the application materials to imply that candidates are members of specific demographic groups, we measured the extent to which race and gender influenced the LLMs’ ratings of the candidates’ suitability. We found moderate race and gender disparities, with the models slightly favoring women and non-White candidates. This pattern persisted across several variations in our experiment. It is unclear what might be driving these disparities, but we hypothesize that they stem from posttraining efforts, which are part of the LLM training process and intended to correct biases in these models. We conclude by discussing the limitations of correspondence experiments for auditing algorithms.