Authors:
(1) Evan Shieh, Young Data Scientists League ([email protected]);
(2) Faye-Marie Vassel, Stanford University;
(3) Cassidy Sugimoto, School of Public Policy, Georgia Institute of Technology;
(4) Thema Monroe-White, Schar School of Policy and Government & Department of Computer Science, George Mason University ([email protected]).
Table of Links
Abstract and 1 Introduction
1.1 Related Work and Contributions
2 Methods and Data Collection
2.1 Textual Identity Proxies and Socio-Psychological Harms
2.2 Modeling Gender, Sexual Orientation, and Race
3 Analysis
3.1 Harms of Omission
3.2 Harms of Subordination
3.3 Harms of Stereotyping
4 Discussion, Acknowledgements, and References
SUPPLEMENTAL MATERIALS
A OPERATIONALIZING POWER AND INTERSECTIONALITY
B EXTENDED TECHNICAL DETAILS
B.1 Modeling Gender and Sexual Orientation
B.2 Modeling Race
B.3 Automated Data Mining of Textual Cues
B.4 Representation Ratio
B.5 Subordination Ratio
B.6 Median Racialized Subordination Ratio
B.7 Extended Cues for Stereotype Analysis
B.8 Statistical Methods
C ADDITIONAL EXAMPLES
C.1 Most Common Names Generated by LM per Race
C.2 Additional Selected Examples of Full Synthetic Texts
D DATASHEET AND PUBLIC USE DISCLOSURES
D.1 Datasheet for Laissez-Faire Prompts Dataset
B.3 Automated Data Mining of Textual Cues
To measure harms of omission (see Supplemental B.4) we collect 1,000 generations per language model per prompt to produce an adequate number of total samples needed for modeling “small-N” populations [35]. On the resulting dataset of 500K stories, it is intractable to hand-extract textual cues from reading each individual story. Therefore, we fine-tune a language model (gpt-3.5-turbo) to perform automated extraction of gender references and names at high precision.
First, we hand-label inferred gender (based on gender references) and name on an evaluation set of 4,600 uniformly down-sampled story generations from all five models, ensuring all three domains and both power conditions are equally represented. This then provides us with a sample dataset to estimate precision and recall statistics on all 500K stories with high confidence (.0063 95CI).
Then, we use ChatGPT 3.5 (gpt-3.5-turbo) to perform automated labeling using the prompt templates shown in Table S7, chosen after iterating through candidate prompts and selecting based on precision and recall. Based on the scenarios and power conditions for each specific story prompt (see Supplement A, Tables S3, S4, and S5), we adjust the “Character” placeholder variable(s) in the prompt template.
For each label response we receive, we then attempt to parse the returned JSON response to perform programmatic post-processing to remove hallucinations (such as references or names that do not exist in the story texts). We report the results of this initial process in Table S8a.
We observe results in line with prior related studies of co-reference resolution that show automated systems to underperform on minoritized identity groups [58]. For example, we note that the pre-trained gpt-3.5-turbo model does not perform well for non-binary pronouns such as they/them, often having difficulty distinguishing between resolutions to individual characters versus groups.
To address such issues, we further hand-label 150 stories (outside of the evaluation dataset) with a specific focus on cases that we found the initial model to struggle with, including non-binary pronouns in the Love domain. This boosts our precision to above 98% for both gender references and names, as shown in Table S8b. Final recall for gender references reaches 97% for gender references and above 99% for names.
We note that fine-tuning a closed-source model such as ChatGPT has potential drawbacks, including lack of awareness if underlying models change. Additionally, OpenAI has not at the time of this writing released detailed information on the algorithms they use for fine-tuning. For future work, the choice of model need not be restricted to ChatGPT, and opensource alternatives may work just as well.
B.4 Representation Ratio
Using observed race and gender, we quantify statistical ratios corresponding to harms of omission and subordination. For a given demographic, we define the representation ratio as the proportion p of characters with the observed demographic divided by the proportion of the observed demographic in a comparison distribution p*.
The choice of comparison distribution p* varies depending on the desired context of study. For example, it could be used to compare against subject or occupation-specific percentages (see Tables S1 and S2). Given prior research observing how definitions of “fairness” may obscure systemic challenges faced by intersectional minoritized groups [37], we focus instead on measuring the relative degree to which our demographics of study are omitted or over-represented beyond sociological factors that already shape demographic composition to be unequal. Therefore, we set p* in our study to be the U.S. Census [83, 85], while noting that more progressive ideals of fairness (e.g. uniformly over-representing under-served groups) cannot be achieved without surpassing Census representation (as a lower standard).
Six of seven racial categories are assigned a likelihood in the 2022 Census [83], excluding MENA as it was only proposed by the OMB in 2023. Therefore, we baseline MENA using overall representation in the Wikipedia dataset [57]. To compute p* for sexual orientation and gender identity (SOGI), we utilize the U.S. Census 2021 Household Pulse Survey (HPS) [85], which studies have shown to reduce known issues of undercounting LGBTQ+ identities [60]. See Table S9 for how we map SOGI to our gender and relationship type schema.