Interrogating AI Bias With The Laissez-Faire Prompts Dataset

Authors:

(1) Evan Shieh, Young Data Scientists League ([email protected]);

(2) Faye-Marie Vassel, Stanford University;

(3) Cassidy Sugimoto, School of Public Policy, Georgia Institute of Technology;

(4) Thema Monroe-White, Schar School of Policy and Government & Department of Computer Science, George Mason University ([email protected]).

Table of Links

Abstract and 1 Introduction

1.1 Related Work and Contributions

2 Methods and Data Collection

2.1 Textual Identity Proxies and Socio-Psychological Harms

2.2 Modeling Gender, Sexual Orientation, and Race

3 Analysis

3.1 Harms of Omission

3.2 Harms of Subordination

3.3 Harms of Stereotyping

4 Discussion, Acknowledgements, and References

SUPPLEMENTAL MATERIALS

A OPERATIONALIZING POWER AND INTERSECTIONALITY

B EXTENDED TECHNICAL DETAILS

B.1 Modeling Gender and Sexual Orientation

B.2 Modeling Race

B.3 Automated Data Mining of Textual Cues

B.4 Representation Ratio

B.5 Subordination Ratio

B.6 Median Racialized Subordination Ratio

B.7 Extended Cues for Stereotype Analysis

B.8 Statistical Methods

C ADDITIONAL EXAMPLES

C.1 Most Common Names Generated by LM per Race

C.2 Additional Selected Examples of Full Synthetic Texts

D DATASHEET AND PUBLIC USE DISCLOSURES

D.1 Datasheet for Laissez-Faire Prompts Dataset

D DATASHEET AND PUBLIC USE DISCLOSURES

D.1 Datasheet for Laissez-Faire Prompts Dataset

Following guidance from Gebru, et al. [79], we document our Laissez-Faire Prompts Dataset (technical details for construction described above) using a Datasheet.

D.1.1 Motivation

1. For what purpose was the dataset created?

We created this dataset for the purpose of studying biases in response to open-ended prompts that describe everyday usage, including students interfacing with language-model-based writing assistants and screenwriters or authors using generative language models to assist in fictional writing.

2. Who created the dataset (for example, which team, research group) and on behalf of which entity (for example, company, institution, organization)?

Evan Shieh created the dataset for the sole purpose of this research project.

3. Who funded the creation of the dataset?

The creation of the dataset was personally funded by the authors.

4. Any other comments?

This dataset primarily studies the context of life in the United States, although we believe that many of the same principles used in its construction can be adapted to settings in other nations and societies globally. This dataset provides a starting point for the analysis of generative language models. We use the term generative language model over the popularized alternative of “large language model” (or “LLM”) for multiple reasons. First, we believe that “large” is a subjective term with no clear scientific standard, and is used largely in the same way that “big” in “big data” is. An example highlighting this is Microsoft’s marketing material describing their model Phi as a “small language model”, despite it having 2.7 billion parameters [80], a number that may have been depicted by other developers as “large” just five years ago [81]. Secondly, we prefer to describe the models we study as “generative” to highlight the feature that this dataset assesses – namely, the capability of such models to product synthetic text. This contrasts non-generative uses of language models such as “text embedding”, or the mapping of written expressions (characters, words, and/or sentences) to mathematical vector representations through algorithms such as word2vec [82].

D.1.2 Composition

5. What do the instances that comprise the dataset represent (for example, documents, photos, people, countries)?

The instances comprising the dataset represent (1) synthetic texts generated by five generative language models (ChatGPT 3.5, ChatGPT 4, Claude 2.0, Llama 2 (7B chat), and PaLM 2) in response to open-ended prompts listed in Tables S3, S4, and S5 in addition to (2) co-reference labels for gender references and names of the fictional characters represented in each synthetic text, extracted directly from the synthetic text.

6. How many instances are there in total (of each type, if appropriate)?

There are 500,000 instances in total or 100K per model that can be further subdivided into 50K power-neutral prompts and 50K power-laden prompts, each of which contains 15K Learning prompts, 15K Labor prompts, and 20K Love prompts.

7. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

Yes, the dataset contains all instances we collected from the generative language models used in this study.

8. What data does each instance consist of?

Model: Which language model generated the text

Time: Time of text generation

Domain: Domain for the prompt (Learning, Labor, or Love)

Power Dynamic: Power-Neutral or Power-Laden

Subject: Character described in prompt (e.g. actor, star student)

Object: Secondary character, if applicable (e.g. loyal fan, struggling student)

Query: Prompt given to language model

Response: Synthetic text in response to Query from the generative language model

Label Query: Prompt used for autolabeling the Response

Label Response: Synthetic text in response to Label Query from the fine-tuned labeling model

Subject References: Extracted gender references to the Subject character

Object References: Extracted gender references to the Object character, if applicable

Subject Name: Extracted name of the Subject character (“Unspecified” or blank means no name found)

Object Name: Extracted name of the Object character, if applicable (“Unspecified” or blank means no name found)

9. Is there a label or target associated with each instance?

None except for extracted gender references and extracted name, which is hand-labeled in 4,600 evaluation examples.

10. Is any information missing from individual instances?

Yes, when LMs return responses containing only whitespace, which we observe in some Llama 2 instances.

11. Are relationships between individual instances made explicit (for example, users’ movie ratings, social network links)?

No, each individual instance is self-contained.

12. Are there recommended data splits (for example, training, development/validation, testing)?

No.

13. Are there any errors, sources of noise, or redundancies in the dataset?

In extracted gender references / names, we estimate a precision error of < 2% and recall error of < 3%.

14. Is the dataset self-contained, or does it link to or otherwise rely on external resources (for example, websites, tweets, other datasets)?

The dataset is self-contained, but for our study we rely on external resources, including datasets containing realworld individuals with self-identified race by first name, which we use for modeling racial associations to names. We do not release linkages to these datasets in the interest of preserving privacy.

15. Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals’ non-public communications)?

No.

16. Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

Yes, including the stereotyping harms we describe in this paper. While we are releasing our dataset for audit transparency and in the hopes of furthering responsible AI research, we disclose the adverse impacts that reading our dataset may be triggering and upsetting to readers. Furthermore, some studies suggest that the act of warning that LMs may generate biased outputs may lead to increased anticipatory anxiety while having mixed results on actually dissuading readers from engaging [77]. We hope that this risk will be outweighed by the benefits of protecting susceptible consumers from otherwise subliminal harms.

17. Does the dataset identify any subpopulations (for example, by age, gender)?

No subpopulations of real-world individuals are identified in this dataset.

18. Is it possible to identify individuals (that is, one or more natural persons), either directly or indirectly (that is, in combination with other data) from the dataset?

Not that we are aware of, as all data included is synthetic text generated from language models. However, since the public is not fully aware of what data or annotations are used in the training processes for the models we study, we cannot guarantee against the possibility of leaked personally identifiable information.

19. Does the dataset contain data that might be considered sensitive in any way (for example, data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?

Not for real individuals. Our dataset extracts gender references and names for synthetically generated characters.

20. Any other comments?

For researchers interested in reproduction of our study, if you require access to the data we mention in question 14, please follow the instructions listed in the papers by the authors we cite.

D.1.3 Collection Process

21. How was the data associated with each instance acquired? Was the data directly observable (for example, raw text, movie ratings), reported by subjects (for example, survey responses), or indirectly inferred/ derived from other data (for example, part-of-speech tags, model-based guesses for age or language)?

The data in each instance was acquired through prompting generative language models for audit purposes.

22. What mechanisms or procedures were used to collect the data (for example, hardware apparatuses or sensors, manual human curation, software programs, software APIs)?

For ChatGPT 3.5, ChatGPT 4, Claude 2.0, and PaLM 2, we used software APIs in combination with texts pulled directly from the online user interface (specifically, 10K of the 100K instances for Claude 2.0). For Llama 2 (7B), we deployed the model on Google Colaboratory instances using HuggingFace software libraries.

23. If the dataset is a sample from a larger set, what was the sampling strategy (for example, deterministic, probabilistic with specific sampling probabilities)?

N/A.

24. Who was involved in the data collection process (for example, students, crowdworkers, contractors) and how were they compensated (for example, how much were crowdworkers paid)?

Only the authors of the study were involved in the data labeling process. For data collection, we paid a student intern $16,000 at a rate of $45 per hour (this included other duties unrelated to the paper as well).

25. Over what timeframe was the data collected?

Data collection was conducted from August 16th to November 7th, 2023.

26. Were any ethical review processes conducted (for example, by an institutional review board)?

No, as no human subjects were involved.

27. Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (for example, websites)?