Can AI Explain A Joke? Not Quite — But It’s Learning Fast

Authors:

(1) Arkadiy Saakyan, Columbia University ([email protected]);

(2) Shreyas Kulkarni, Columbia University;

(3) Tuhin Chakrabarty, Columbia University;

(4) Smaranda Muresan, Columbia University.

Editor’s note: this is part 4 of 6 of a study looking at how well large AI models handle figurative language. Read the rest below.

Table of Links

4 Experiments

We empirically study how several baseline models perform on the task of explainable visual entailment. We investigate both off-the-shelf and finetuned model performance.

4.1 Models

We select a variety of models for our study (see taxonomy in Appendix, Figure 10). For off-the-shelf models, we explore both open and APIbased models. For open models, we select the (current) state-of-the-art LLaVA-1.6 models (Liu et al., 2024). LLaVA is one of the simplest, yet one of the most high-performing VLM architectures currently available. It utilizes a pretrained large language model (e.g., Mistral-7B (Jiang et al.,2023)) and a vision-language cross-modal connector (e.g., an MLP layer) to align the vision encoder (e.g., CLIP (Radford et al., 2021)) outputs to the language models. We select LLaVA-1.6 models in their 7B and 34B configurations (LLaVA-v1.6-7B and LLaVA-v1.6-34B respectively) and refer to them as LLaVA-ZS-7B and LLaVA-ZS-34B. Both models have been instruction-tuned on less than 1M visual instruction tuning samples to act as general language and vision assistants. It should, however, be noted that these models do not currently support few-shot multimodal prompting.

In addition to zero-shot testing, we also test these models using Compositional Chainof-Thought Prompting proposed by Mitra et al. (2023). The method first prompts the model to generate a scene graph and then utilizes that scene graph in another prompt to answer the relevant question. The method works zero-shot without requiring fine-tuning. We refer to these models as LLaVA-ZS-7B-SG and LLaVA-ZS-34B-SG for the 7B and 34B LLaVA configurations described above.

For API-based models, we select three widely available state-of-the-art VLMs: Claude-3 Opus (claude-3-opus-20240229)(Anthropic, 2024), GPT-4 (gpt-4-1106-vision-preview) (OpenAI, 2023) and GeminiPro (gemini-pro-vision)(Team et al., 2023). We refer to GPT-4 as the “teacher” model as most candidate explanations were generated with it.

For fine-tuned models, we focus on fine-tuning LLaVA-1.5-7B model (Liu et al., 2023a) (the finetuning code for 1.6 model is not available during the time the paper was written). To minimize bias for a single instruction, we fine-tune and evaluate the models on a set of 21 instruction paraphrases (see Appendix Table 8). Three model configurations are tested:

• LLaVA-eViL is a checkpoint of LLaVA-v1.5- 7B further fine-tuned on the eViL (e-SNLIVE) dataset for explainable visual entailment (Kayser et al., 2021) converted to the instruction format. We removed neutral label instances, which resulted in 275,815 training instances and 10,897 validation instances.

• LLaVA-VF is the same checkpoint fine-tuned on the training set of V-FLUTE. We also fine-tune the model with a white square instead of the VFLUTE image (denoted by −Image).

• LLaVA-eViL+VF is the same checkpoint finetuned on both eViL and V-FLUTE.

All hyperparameters are in Appendix C.

4.2 Automatic Metrics

Similarly to prior work (Chakrabarty et al., 2022) we utilize both classic F1 score and an adjusted score that accounts for explanation quality: F1@ExplanationScore. The ExplanationScore computes the average between BERTScore (Zhang* et al., 2020) based on the microsoft-deberta-xlarge-mnli model (He et al., 2021; Williams et al., 2018) and BLEURT (Sellam et al., 2020) based on the BLEURT-20 (Pu et al., 2021). Since our goal is to ensure models provide an answer for the right reasons, ideally, we would only count predictions as correct when the explanation is also correct. Hence, we report F1@0 (simply F1 score), F1@53 (only predictions with explanation score > 53 are considered correct), and F1@60. Thresholds are selected based on human evaluation of explanation quality in Section 5.3.

4.3 Automatic Evaluation Results

Table 3 shows the results based on the automatic evaluation. We also include results per phenomenon in Appendix F and the drop in performance when accounting for explanations score in Figure 6. Our results inform the following insights:

Fine-tuning on V-FLUTE leads to best classification performance on average across datasets. Our strongest fine-tuned model (LLaVA-7BeViL+VF) outperforms the best off-the-shelf model (GPT-4-5shot) in terms of the F1@0 score (p < 0.03; all p values reported via paired bootstrap test (Koehn, 2004)), and performs competitively when incorporating the explanations quality with GPT-4 leading slightly (F1@60 of 49.81 vs. 48.80 for the best fine-tuned model), which is expected as GPT-4 is the teacher model with which the majority of the explanation candidates were generated. Adding the e-ViL dataset improves the performance slightly compared to only fine-tuning on V-FLUTE. Fine-tuning merely on e-ViL improves over a random baseline; however, the explanations are of poor quality.

We also utilize a hypothesis-only baseline (Poliak et al., 2018) by including a model fine-tuned on the V-FLUTE dataset, but without the relevant

Table 3: F1 Score results for different models across thresholds 0.0, 0.53, and 0.6 for explanation score. Best result overall is in bold, best result in each category is underlined.

image (with a white square as an input instead, denoted as −Image). Fine-tuning on the full VFLUTE dataset shows an improvement of over 8 points in F1@0 (better with p < 0.002), suggesting VLMs benefit from visual information when dealing with figurative phenomena and do not just rely on the input text to make their prediction.

Open zero-shot instruction-tuned models are lagging behind API-based models, but scene graph prompting improves performance. LLaVA-7B and 34B lag behind Claude 3 and GPT-4 in zero-shot settings. However, scene graph prompting improves the zero-shot performance of the LLaVA-based models, allowing them catch up to zero-shot API model performance (Gemini and Claude 3). The explanations generated by these models tend to overly focus on the contents of the scene graph rather than the underlying figurative phenomena, possibly causing a decrease in explanation score (and consequently in F1@60). The few-shot API models outperform zero-shot API models, and are better than all configurations of open models in F1@0, 53, 60, indicating the effectiveness of few-shot prompting (not available for LLaVA-based models as of now).

Performance for models decreases when taking into account explanation quality. We plot the relative percentage decrease between F1@0 and F1@60 for LLaVA-eViL-VF, LLaVA-34BSG, and GPT-4-5shot in Figure 6. Higher relative drop indicates higher difficulty of generating the correct explanation. For all models, we see a substantial decrease in performance, especially on challenging phenomena such as Humor (NYCartoons). For Metaphor (IRFL), Humor (MemeCap) and Idiom (IRFL) subsets GPT-4 exhibits the lowest relative performance drop, while for Metaphor (HAIVMet), Humor (NYCartoons) and Sarcasm (MuSE) the fine-tuned model has the lowest drop.

We can see that the percentage drop is substantially higher for all models for the HAIVMet subset rather than the IRFL dataset, which contains metaphors in the image rather than in the text. This suggests it is harder for models to generate correct explanations when the figurative meaning is contained in the image rather than in the text, indicating the need to expand current datasets to include images with figurative meaning.

Figure 6: % Drop in F1 score for various models by source dataset between 0 to 0.6. Higher drop indicates higher proportion of wrongly generated explanations.

4.4 Human Baseline

To find out how humans perform on the task, we hire two expert annotators with formal education in linguistics. We present them with 10 example instances and then ask them to complete 99 randomly sampled test set instances. We also evaluate our best model (see Table 3) on the same set. Results are shown in Table 4. Human performance is quite strong, almost reaching 90 F1@0 score overall. Human performance is better than our strongest fine-tuned model (LLaVA-7B-eVil+VF) performance with p < 0.05 for Annotator 1 and p < 0.07 for Annotator 2. Humans excel at interpreting memes, with both annotators reaching a 100% F1 score. Humans also perform noticeably better on the NYCartoons dataset and on the idiom subset of the task. The model has a slight edge in performance on the sarcasm and visual metaphor subsets of the task, perhaps due to difficulty of these subsets and any potential spurious correlations during fine-tuning.

Table 4: Human baseline results (F1@0) by phenomenon and source dataset.

Can AI Explain a Joke? Not Quite — But It’s Learning Fast | HackerNoon

Table of Links

4 Experiments

4.1 Models

4.2 Automatic Metrics

4.3 Automatic Evaluation Results

4.4 Human Baseline

Leave a Reply Cancel reply

Stay Connected

Latest News

F5 BUY CALYPSOAI to offer advanced protection from AI to companies

The Best Nintendo Switch Games We’ve Played for 2025

62 Patches Posted For Stripping Classic Initrd Support From The Linux Kernel

Cloud block storage: Key benefits and use cases | Computer Weekly

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

4 Experiments

4.1 Models

4.2 Automatic Metrics

4.3 Automatic Evaluation Results

4.4 Human Baseline

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News