Authors:
(1) Tony Lee, Stanford with Equal contribution;
(2) Michihiro Yasunaga, Stanford with Equal contribution;
(3) Chenlin Meng, Stanford with Equal contribution;
(4) Yifan Mai, Stanford;
(5) Joon Sung Park, Stanford;
(6) Agrim Gupta, Stanford;
(7) Yunzhi Zhang, Stanford;
(8) Deepak Narayanan, Microsoft;
(9) Hannah Benita Teufel, Aleph Alpha;
(10) Marco Bellagente, Aleph Alpha;
(11) Minguk Kang, POSTECH;
(12) Taesung Park, Adobe;
(13) Jure Leskovec, Stanford;
(14) Jun-Yan Zhu, CMU;
(15) Li Fei-Fei, Stanford;
(16) Jiajun Wu, Stanford;
(17) Stefano Ermon, Stanford;
(18) Percy Liang, Stanford.
Table of Links
Abstract and 1 Introduction
2 Core framework
3 Aspects
4 Scenarios
5 Metrics
6 Models
7 Experiments and results
8 Related work
9 Conclusion
10 Limitations
Author contributions, Acknowledgments and References
A Datasheet
B Scenario details
C Metric details
D Model details
E Human evaluation procedure
C Metric details
C.1 Human metrics
We rely on human annotators to evaluate the generated images based on several aspects: alignment, quality, aesthetics, and originality. For quality, we focus on photorealism. For aesthetics, we focus on subject clarity and overall aesthetics of the generated images. The following is the full detail of the human evaluation questions.
To obtain reliable human evaluation results, we employ crowdsourcing methodology in [35]. Concrete word descriptions are provided for each question and rating choice, and a minimum of 5 crowdsource workers evaluate each image. We use at least 100 image samples for each aspect being evaluated. For a more detailed description of the crowdsourcing procedure, see §E.
Overall alignment. We investigate whether the generated image meets the annotators’ expectations by asking them to rate how well the image matches the description using a 5-point Likert scale, similar to [35]:
How well does the image match the description?
a) Does not match at all
b) Has significant discrepancies
c) Has several minor discrepancies
d) Has a few minor discrep
Photorealism. While photorealism alone does not guarantee superior quality in all contexts, we include it as a measure to assess the basic competence of the text-to-image model. To evaluate photorealism, we employ the HYPE∞ metric [25], where annotators distinguish between real and model-generated images based on 200 samples, with 100 being real and 100 being model-generated. Following [35], below is the multiple-choice question asked of human annotators for both real and generated images:
Determine if the following image is AI-generated or real.
a) AI-generated photo
b) Probably an AI-generated photo, but photorealistic
c) Neutral
d) Probably a real photo, but with irregular textures and shapes
e) Real photo
Subject clarity. We assess the subject clarity by evaluating whether the generated image effectively highlights the focal point, following principles commonly shared in art and visual storytelling [68]. We accomplish this by asking annotators to determine if the subject of the image is apparent over a 3-point Likert scale:
Is it clear who the subject(s) of the image is? The subject can be a living being (e.g., a dog or person) or an inanimate body or object (e.g., a mountain).
a) No, it’s unclear.
b) I don’t know. It’s hard to tell.
c) Yes, it’s clear.
Overall aesthetics. For the overall aesthetics, we aim to obtain a holistic assessment of the image’s appeal by asking annotators to rate its aesthetic pleasingness:
How aesthetically pleasing is the image?
a) I find the image ugly.
b) The image has a lot of flaws, but it’s not completely unappealing.
c) I find the image neither ugly nor aesthetically pleasing.
d) The image is aesthetically pleasing and nice to look at it.
e) The image is aesthetically stunning. I can look at it all day.
Overall originality. We assess whether the generated images offer a unique interpretation based on the provided description, as this is valued by both creators and audiences. We achieve this by asking annotators to rate the image’s originality given the prompt:
How original is the image, given it was created with the description?
a) I’ve seen something like this before to the point it’s become tiresome.
b) The image is not really original, but it has some originality to it.
c) Neutral. d) I find the image to be fresh and original.
e) I find the image to be extremely creative and out of this world.
C.2 Automated metrics
CLIPScore. CLIPScore [24] measures how well an image is aligned with a corresponding natural language description using the pre-trained CLIP model [73]. It is a commonly-used metric for textimage alignment [6].
Fréchet Inception Distance (FID). Fréchet Inception Distance (FID) [23] is a benchmark metric used for evaluating the quality of images generated by models [6, 4, 74]. It quantifies how similar the generated images are to reference images, measured by the Inception Net [75].
To compute the FID, we randomly selected 30,000 text prompts from MS-COCO and generated a single image for each prompt using the text-to-image generation model that we are evaluating. Then, we resized the images to 512×512 and used [76] to compute the FID between the set of real images associated with the prompts and the set of generated images.
Inception score (IS). Inception score (IS) [36] is a benchmark metric used for evaluating the quality of image-based generative models [4, 77]. We follow their settings and compute the IS scores for evaluating the image quality of the generative models. We use the implementation [78] to compute the metric.
LAION Aesthetics. We follow LAION-Aesthetics and use the open-sourced predictor[18] for evaluating the aesthetic score of a generated image.
Fractal coefficient. The fractal coefficient is a measure of the fractal patterns in an image, indicating the degree of detail and complexity present at different scales. It can be used to assess the aesthetic quality of images by quantifying their level of visual intricacy and richness [41]. We used the following code snippet to compute the metric: https://gist.github.com/viveksck/ 1110dfca01e4ec2c608515f0d5a5b1d1. We report the absolute difference of the fractal coefficient from 1.4, which is the most aesthetically pleasing to humans according to [41].
Watermark. The unintentional generation of watermarked images poses a risk of trademark and copyright infringement. We use the watermark detector from LAION [40] to determine whether an image contains watermarks.
LAION NSFW. We use the NSFW (not safe for work) classifier from LAION [40] to determine whether an image contains inappropriate content.
NudeNet. In accordance with [8], we employ the pre-trained NudeNet [44] to determine whether an image contains nudity.
Blackout. Stable Diffusion [4] incorporates a module that post-processes images by blacking them out if they are considered inappropriate. This metric represents whether the image has been blacked out.
API rejection. The DALL-E 2 API [3] rejects requests when the provided prompt can lead to the generation of inappropriate content. This metric represents whether the image has been rejected. It pertains specifically to DALL-E 2.
Gender bias. This metric assesses the presence of binary gender bias within a set of images, following [33, 29]. Specifically, we obtain 8 output images from the model based on a given prompt. For each image, we employ CLIP to compare it against the prompts {“a photo of a male”, “a photo of a female”} and select the gender with the higher probability. The proportion of female representation is then computed among the 8 images. Finally, we calculate the L1 norm distance between this proportion and 0.5, which serves as the measure of gender bias.
Skin tone bias. This metric assesses the presence of skin tone bias within a set of images, following [33, 29]. Specifically, we obtain 8 output images from the model based on a given prompt. For each image, we identify skin pixels by analyzing the RGBA and YCrCb color spaces. These skin pixels are then compared to a set of 10 MST (Monk Skin Tone) categories, and the closest category is selected. Using the 8 images, we compute the distribution across the 10 MST skin tone categories, resulting in a vector of length 10. Finally, we calculate the L1 norm distance between this vector and a uniform distribution vector (also length 10), with each value set to 0.1. This calculated error value serves as the measure of skin tone bias.
Fairness. This metric, inspired by [1], assesses changes in model performance (human-rated alignment score and CLIPScore) when the prompt is varied in terms of social groups. For instance, this involves modifying male terms to female terms or incorporating African American dialect into the prompt (see MS-COCO (gender) and MS-COCO (dialect) in §B). A fair model is expected to maintain consistent performance without experiencing a decline in its performance.
Robustness. This metric, inspired by [1], assesses changes in model performance (human-rated alignment score and CLIPScore) when the prompt is perturbed in a semantic-preserving manner, such as injecting typos (see MS-COCO (typos) in §B). A robust model is expected to maintain consistent performance without experiencing a decline in its performance.
Multiliguality. This metric assesses changes in model performance (human-rated alignment score and CLIPScore) when the prompt is translated into non-English languages, such as Spanish, Chinese, and Hindi. We use Google Translate for the translations (see MS-COCO (languages) in §B). A multilingual model is expected to maintain consistent performance without experiencing a decline in its performance.
Inference time. Using APIs introduces performance variability; for example, requests might experience queuing delay or interfere with each other. Consequently, we use two inference runtime metrics to separate out these concerns: raw runtime and a version with this performance variance factored out called the denoised runtime [79].
Object detection. We use the ViTDet [43] object detector with ViT-B [80] backbone and detectron2 [81] library to automatically detect objects specified in the prompts. The object detection metrics are measured with three skills, similar to DALL-Eval [29]. First, we evaluate the object recognition skill by calculating the average accuracy over N test images, determining whether the object detector accurately identifies the target class from the generated images. Object counting skill is assessed similarly by calculating the average accuracy over N test images and evaluating whether the object detector correctly identifies all M objects of the target class from each generated image. Lastly, spatial relation understanding skill is evaluated based on whether the object detector correctly identifies both target object classes and the pairwise spatial relations between objects. The target class labels, object counts, and spatial relations come from the text prompts used to query the models being evaluated.
[18] https://laion.ai/blog/laion-aesthetics/