Authors:
(1) Arkadiy Saakyan, Columbia University ([email protected]);
(2) Shreyas Kulkarni, Columbia University;
(3) Tuhin Chakrabarty, Columbia University;
(4) Smaranda Muresan, Columbia University.
Editor’s note: this is part 2 of 6 of a study looking at how well large AI models handle figurative language. Read the rest below.
Table of Links
Textual entailment (MacCartney and Manning, 2008; Bowman et al., 2015) and visual entailment (Xie et al., 2019) tasks have been proposed to measure language and multimodal understanding. However, models trained to simply improve label accuracy on these data can be brittle and suffer from spurious correlations (Poliak et al., 2018; Gururangan et al., 2018; McCoy et al., 2019; Gardner et al., 2021). Datasets such as e-SNLI (Camburu et al., 2018) and e-SNLI-VE (Kayser et al., 2021) augment existing entailment datasets with natural language explanations and train models to not only predict the label, but also generate a textual explanation for the reason behind the prediction. Such approach has been further adopted for a variety of tasks, such as commonsense reasoning (Rajani et al., 2019; Aggarwal et al., 2021) and social norm understanding (CHWang et al., 2023) among others (Wiegreffe and Marasovic, 2021). This approach has been extended to assess LLMs’ capabilities on understanding figurative language through the FLUTE dataset (Chakrabarty et al., 2022). FLUTE frames figurative language understanding as an explainable textual entailment task. Recent progress in multimodal models (Li et al., 2022; Alayrac et al., 2022; OpenAI, 2023; Team et al., 2023; Liu et al., 2023b; Anthropic, 2024) prompts us to asses similar capabilities when extended to multimodal setting, testing the understanding of nonliteral meaning contained in both images and text. We present an equivalent of the FLUTE dataset for the visual modality: V-FLUTE.
A number of previous works has focused on modeling figurative phenomena beyond text. Chakrabarty et al. (2023) use a human-AI collaboration framework to generate visual metaphors from linguistic metaphors (HAIVMet dataset) and propose a visual entailment task as an extrinsic evaluation of dataset quality. The dataset contains images, claims, and labels, but no textual explanations. Yosef et al. (2023) proposed a benchmark (IRFL) where given an idiom, metaphor, or simile the model has to distinguish which of the four associated images implies the figurative meaning of the expression. This dataset focuses on the figurative meaning in the textual modality and does not contain textual explanations. There has also been work on understanding multimodal sarcasm with explanations (Desai et al., 2022), mostly containing noisy user-generated text and crowdworkerwritten explanations. Other line of work has focused on understanding humor with multimodal models. MemeCap (Hwang and Shwartz, 2023) is a dataset for understanding memes. Hessel et al. (2023) release a corpus of annotated New Yorker Caption Contest entries, where the goal is to come
up with a humorous captions for an image, with high-quality explanations for why the caption is humorous. The dataset is relatively limited in size containing only 520 unique instances in its training set. We leverage all these benchmarks to build V-FLUTE.