Measuring AI Creativity: Study Methods For Comedians & LLMs

Table of Links

Abstract and 1. Introduction

Methods
Quantitative Results and Creativity Support Index
Qualitative Results from Focus Group Discussions
Discussion
Mitigations and Conclusion and Acknowledgments
Ethical Guidance References

A. Related Work on Computational Humour, AI and Comedy

B. Participant Questionaire

C. Focus

2 METHODS

Our study was designed to address a challenging problem, with on one hand, limitations of LLMs (stereotypes, inability to distinguish comedic offensiveness from harmful speech, cultural erasure and homogeneisation of content), and on the other hand, the use of LLMs for a creative writing task. For this reason, we asked a group of experts—professional comedians and performers—who are used both to thinking about thorny questions of identity, offensiveness and censorship in their work, and to employing language in a highly creative way. We chose artists who already use AI in their work and expected them to be somewhat knowledgeable and open to using AI: this likely biased our results[3].

We ran workshops with 20 comedians who use AI creatively. The first workshop with 10 participants was run in person at Edinburgh Festival Fringe 2023; the following 3 workshops with 3, 4 and 3 participants were run online. We reached out to comedians performing in Edinburgh during Fringe, or in our network, and attempted to recruit as diverse (along linguistic, cultural, gender, sexual, national and racial dimensions) a pool of comedians as possible given the constraints of the study[4]. Participants had contrasting views on AI for comedy writing, from “AI is very bad at this, and I don’t want to live in a world where it gets better” (p15) to “I liked the details that I got. I think those details sparked my imagination, and I think I could use them to write something” (p20). Participants were asked to register on the Prolific platform[5] and invited to join a specific study thanks to an allowlist. The study was approved by the research ethics committee of our institution. The information sheet and consent forms were shared with the participants, their active consent was obtained at the beginning of the workshop and they had the right to withdraw without prejudice at any time. The Prolific platform handled the payment of their participation fee, set to £300 for 3 hours.

We started each 3-hour session by describing the agenda and goals of the workshop, sharing the information sheet and consent forms with the participants, and asking them to start filling out a short anonymous survey about their background in comedy, previous exposure to AI and usage of AI in performance (full questionnaire in Appendix B.1).

2.1 Writing exercise

We then proceeded with a comedy-writing exercise, in which participants spent around 45 minutes on their own, using an LLM. We encouraged participants to try to use the LLM in a way that would generate useful material “that they would be comfortable presenting in a comedy context”, but emphasized that we did not require a fully-finished product by the end of the writing exercise. We invited them to use the language(s) they felt the most comfortable with[6]. We also suggested they could use the tool to 1) generate, rate/detect or explain jokes, 2) co-write jokes via iterative prompting, step-bystep or using examples, and 3) analyse, re-write or complete some of their previous material. In the first workshop (in person), we provided participants with access to ChatGPT-3.5 [79] served via a plain text interface similar to ChatGPT. In the following 3 workshops, we invited participants to use their own preferred model via their personal account: participants used ChatGPT-3.5, ChatGPT-4 [77] and Google Bard powered by Gemini Pro [98] (December 2023 version). Note that the choice of such instruction-tuned models was motivated by their popularity and ease of access by comedians, and more complex prompting strategies, such as used in Dramatron [71], could have produced higher-quality outputs.

2.2 Creativity Support Tools evaluation

Following the writing exercise, we asked participants to fill out three surveys. The first survey was about their experience with the AI system for writing comedy material and contained nine questions from previous studies [53, 71, 114] that assessed LLMs for creative writing on the 5-level Likert scale (see Appendix B.2). The second survey was used to calculate the Creativity Support Index (CSI) [25] of the writing tool, which itself was adapted from the NASA Task Load Index [43]. CSI is estimated in a psychometric survey that measures six dimensions of creativity support: Exploration, Expressiveness, Immersion, Enjoyment, Results Worth Effort, and Collaboration (see specific questions in Appendix B.3), and is a number between 0 and 100, where 90 is considered excellent and 50 mediocre. The third survey contained free-form questions on one thing that the “AI system” (the LLM writing tool) did well, one improvement, and open-ended comments on the writing session and on the survey.

2.3 Focus Group Questions

In order to guide the discussion, we prepared two sets of questions[7] (see Appendix B.4 for the full list of questions). The first set of questions pertained to the usefulness of the outputs generated by the LLM tool for personal writing, differences between using an LLM or searching for inspiration using Wikipedia or a search engine, the types of comedy that can be produced by an LLM, and concerns about the ownership of LLM-generated outputs.

The second set of questions addressed the comedy writing process of the participants, as well as the topics introduced in Section 1.2, namely various biases and stereotypes of LLMs, problems with moderation strategies employed by LLMs, the importance of context and delivery or whether some forms of cultural appropriation or homogeneisation could happen. We invited discussions about the use of other comedians’ work, and also challenged the participants with question on whether the AI has a “voice” and if humour can be quantified.

2.4 Focus Group Analysis

In our workshops, we had followed focus group methodology described in [74, 76] (engaging a group of participants in an informal one hour discussion focused around a particular topic, activity, or stimulus material, with a team of two moderators). Transcripts of focus groups were recorded as audio recordings, then automatically transcribed using speech recognition tools in Google Meet, and manually verified as well as compared against notes taken by the moderators. After transcription, audio and video recordings were destroyed. Like in the surveys, participants were anonymised: authors independently reviewed the transcripts to remove any personally identifiable information from the transcripts. We then performed constant comparison analysis to analyze the transcripts of the focus groups [76]. We first identified initial codes using sentence-by-sentence open coding. We then grouped those codes into themes, and found themes that were coherent across focus groups. Data from four focus groups allowed us to achieve data saturation [17, 63].

Results section 3 summarises the quantitative results[8] derived from the Creativity Support Tool evaluation (Sect. 2.2), while results section 4 details the observations made by the participants during focus groups (Sect. 2.4). Please note that this paper is an exploration of external perspectives rather than an endorsement of any one of them; in particular, this paper does not seek to undertake any legal evaluation.

:::info
Authors:

(1) Piotr W. Mirowski∗, Google DeepMind London, UK ([email protected]);

(2) Juliette Love∗, Google DeepMind London, UK ( [email protected]);

(3) Kory Mathewson, Google DeepMind Montréal, QC, Canada ([email protected]);

(4) Shakir Mohamed, Google DeepMind London, UK ([email protected]).

:::

:::info
This paper is available on arxiv under CC BY 4.0 license.

:::

[3] Our biased selection criteria of participants might, and likely do, lead to biased opinions as compared to the much more broad population of comedians and performers, which might be reflected in more favourable judgment of the Creativity Support Index of LLM writing tools. Future research might explore the diversity of opinions in creative communities across a greater range of familiarity with AI tools and openness to using them in their own creative practices. Exploring those opinions would significantly increase the scope of the paper and would make a compelling follow-up study.

[4] A demographic analysis of opinions might be a possible avenue for future investigations, but it would require a different study design and participant recruiting process.

[5] https://prolific.com

[6] Languages included German, Dutch, English, French, Hindi, Swedish and Tamil.

[7] Question-led focus groups are useful to start discussions, but we acknowledge the limitation that questions can bias the participants’ responses.

[8] Full outputs of the writing sessions, all individual survey results and raw transcripts from the focus groups will be shared in anonymised form as supplementary material, once our work is published.