By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Researchers Combine GPT-4 and Human Experts to Train AI on Visual Figurative Reasoning | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Researchers Combine GPT-4 and Human Experts to Train AI on Visual Figurative Reasoning | HackerNoon
Computing

Researchers Combine GPT-4 and Human Experts to Train AI on Visual Figurative Reasoning | HackerNoon

News Room
Last updated: 2025/06/18 at 5:15 PM
News Room Published 18 June 2025
Share
SHARE

Authors:

(1) Arkadiy Saakyan, Columbia University ([email protected]);

(2) Shreyas Kulkarni, Columbia University;

(3) Tuhin Chakrabarty, Columbia University;

(4) Smaranda Muresan, Columbia University.

Editor’s note: this is part 3 of 6 of a study looking at how well large AI models handle figurative language. Read the rest below.

Table of Links

3 V-FLUTE Task and Dataset

To build V-FLUTE, we start with existing multimodal figurative datasets and use human-AI collaboration frameworks with expert annotators (Chakrabarty et al., 2022; Wiegreffe et al., 2022; Liu et al., 2022) to transform them into a highquality, explainable visual entailment benchmark. These datasets cover particular phenomena such as metaphors, similes, idioms, sarcasm or humor. Each instance includes an image and a caption and the figurative phenomenon can be either in the image, the caption or in both. We transform each data into a unified format for explainable visual entailment. An overview of the dataset and our contributions can be found in Table 1. See examples from each dataset in Table 2. Below, we describe the construction of V-FLUTE for each figurative language type (metaphors & similes, idioms, sarcasm and humor).

Metaphors and similes are powerful rhetorical devices that can be expressed either in text or visually in an image. Visual metaphors are used as persuasive devices in various fields such as advertising (Forceville, 2002; Scott, 1994). To create visual entailment instances containing metaphors and similes in V-FLUTE, we rely on two existing resources: HAIVMet (Chakrabarty et al., 2023) and IRFL (Yosef et al., 2023). Instances taken from HAIVMet contain the metaphor/simile as a part of the premise (image), while those taken from IRFL have the metaphor/simile as a part of the hypothesis (text).

Table 2: Sample dataset instances form V-FLUTE corresponding to the source datasets displaying images (premise), claims (hypothesis), labels, and explanations [Row 1-5].Table 2: Sample dataset instances form V-FLUTE corresponding to the source datasets displaying images (premise), claims (hypothesis), labels, and explanations [Row 1-5].

Figure 2: Creation of V-FLUTE instances for methaphors and similies from HAIVMet.Figure 2: Creation of V-FLUTE instances for methaphors and similies from HAIVMet.

3.1.1 HAIVMet as Data Source

The HAIVMet (Chakrabarty et al., 2023) data consists of 1,193 images of visual metaphors spanning over 958 distinct linguistic metaphors. Each image is associated with a claim that can be contradicting or entailing the image. In addition, each image is associate with a visual elaboration that presents a textual description of the image (See Figure 2). This visual elaboration was used in the original paper to generate the visual metaphors (images).

Generating Textual Explanations. We augment the dataset with candidate textual explanations. We prompt ChatGPT (gpt-3.5-0914) to generate an explanation for every tuple (See Figure 2; and prompt in Appendix D.1.1).

Expert Verification. Each claim is paired with up to 5 images. However, since these images were automatically generated with DALLE-2 using the visual elaborations, not all are completely faithful. Moreover, some claims and labels were inconsistent. Finally, automatically generated LLM candidate explanations are not always correct and require refining. To tackle these issues, we employ an expert verification process involving three expert annotators with significant experience in figurative language and visual metaphor understanding. Since each claim can be paired with more than one visual metaphor, we ask annotators to select the visual metaphor most faithful to the linguistic metaphor and visual elaboration (see Image Selection in Figure 2) or select none in the rare case when none of the visual metaphors are of good quality. As a part of the same annotation round, we also ask them to verify and edit the explanation if necessary to ensure correctness and high quality. Post strict quality control, we have 857 instances.

3.1.2 IRFL as Data Source

Figure 3: Creation of V-FLUTE instances for metaphors, similes, idioms from IRFL.Figure 3: Creation of V-FLUTE instances for metaphors, similes, idioms from IRFL.

The IRFL dataset (Yosef et al., 2023) contains 1,440 figurative expressions, each associated with 4 distinct images. One of those images represents the figurative expression (see Figure 3), and the other 3 act as distractors.

Image Selection. We automatically select images using CLIP (Radford et al., 2021). We select one of the distractor images that have the highest CLIPScore (clip-vit-base-patch16) with the corresponding entailing image to create a challenging, contradictory instance (see where an unrelated image of a house is discarded when selecting the contradiction instance in Figure 3).

Generating Textual Explanations. We prompt GPT-4 (gpt-4-vision-preview) with the ground truth label, claim, and the image to explain the relationship between the image and the claim.

Expert Verification. We recruit the same three expert annotators from HAIVMET annotations and ask them to verify the explanation is adequate and edit it when necessary. We also ask the annotator to discard rare noisy instances where the claim, image, and label do not fit. Post strict quality control, we are left with 1149 instances.

3.2 Idioms

The IRFL dataset contains idioms in addition to metaphors and similies. An identical procedure to the one described in Section 3.1.2 was used for generating V-FLUTE instances for idioms (370 examples).

3.3 Sarcasm

To create visual entailment instances containing sarcasm, we rely on the MuSE data (Desai et al., 2022). Similarly to IRFL, instances from MuSE data contain sarcasm in the hypothesis (text).

3.3.1 MuSE as Data Source

Figure 4: Creation of V-FLUTE instances for sarcasm from MuSE.Figure 4: Creation of V-FLUTE instances for sarcasm from MuSE.

The MuSE dataset (Desai et al., 2022) consists of 3510 distinct images, the respective sarcastic claims that act as contradiction instances (see example in Figure 4), and crowd worker written explanations justifying the contradiction.

Generating Entailment Claims. Since the dataset only contains sarcastic instances, there are no claims with an entailment relationship. We generate the entailing claims by prompting GPT-4 to generate a non-sarcastic version of the claim while maintaining the user-generated informal style of the text (see the generated entailment claim in Figure 4).

Generating Textual Explanations. While the dataset already contains crowdworker-written explanations, upon inspection, they were often deemed poor quality, lacking enough details, and formulaic (e.g., see the crowdworker explanation in Figure 4). To improve their quality, we use the dataset’s existing crowdworker explanations and prompt GPT-4 to rewrite and generate high-quality candidate textual explanations given the claim and the label (see the re-written explanation in Figure 4). See the prompt in Appendix D.3.

Expert Verification. Each image is now paired with a GPT-4-generated entailing claim, an original contradicting claim, and their respective labels and explanations. The same three expert annotators checked if the generated explanations are adequate (i.e., complete, correct, and concise) and if not, edit them. Experts were also instructed to discard noisy examples, e.g. when the image does not contradict the sarcastic claim. Through strict quality control, we obtain 1,042 instances.

3.4 Humor

For multimodal humor, we rely on two datasets: MemeCap (Hwang and Shwartz, 2023) and New Yorker cartoons (Hessel et al., 2023).

3.4.1 MemeCap as Data Source

Figure 5: Creation of V-FLUTE instances for humor from MemeCap.Figure 5: Creation of V-FLUTE instances for humor from MemeCap.

This dataset consists of memes along with their captions that describe the meme poster’s intent (see example in Figure 5). Memes frequently contain implicit, non-literal meaning (Lestari, 2019) and rely on visual metaphors (Piata, 2016), posing a challenge to VLMs.

Claim Generation. Since meme captions are not suited for an entailment task, we perform prompt GPT-4 with the caption to generate a claim from it (see example in Figure 5). We filter these set of samples further with GPT-4 by asking whether the image entails the claim and only selecting positive instances. In addition to generating claims that entail the meme, we generate counterclaims using GPT-4.

Generating Textual Explanations. We prompted GPT-4 with the ground truth label in the prompt to explain the relationship between the image and the claim. See prompts in Appendix D.4.

Expert Verification. We hire the same three expert annotators to ensure the correctness of the data. Each annotator is tasked with verifying that 1) the generated claim fits the image and 2) the explanation is correct and complete, and if not, make the necessary changes. We also ask to discard samples with inappropriate content. After careful quality control, we have 1958 instances.

3.4.2 NYCartoons as Data Source

The NYCartoons dataset (Hessel et al., 2023) contains 651 high-quality instances from the New Yorker Cartoon Caption Contest. Each instance consists of a humorous image paired with a caption and a natural language explanation justifying the implicit humor between the caption and the image. We simply use the existing data where the caption is treated as a claim entailing the humorous image paired with an explanation.

3.5 Dataset Statistics

We split our data into 4,578 training, 726 validation, and 723 testing instances. Detailed counts per phenomenon and dataset, as well as other statistics, are in Appendix A.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article How tech companies are, or are not, recognizing Pride in 2025
Next Article Salt Typhoon Hackers Are Back, Recently Targeted Satellite Internet Service Viasat
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

PSA: Fortnite for iPhone and iPad no longer runs on iOS 26 beta after Blitz Royale update – 9to5Mac
News
Today's NYT Mini Crossword Answers for June 19 – CNET
News
Alibaba’s 1688 to eliminate “refund only” policy · TechNode
Computing
Why You Should Try Heat and Eat Meal Prep Delivery Today
Gadget

You Might also Like

Computing

Alibaba’s 1688 to eliminate “refund only” policy · TechNode

1 Min Read
Computing

👨🏿‍🚀 Daily – Starlink sets up shop in Guinea-Bissau |

2 Min Read
Computing

Capybara Go! takes $40 million in overseas revenue within two months of launch · TechNode

1 Min Read
Computing

The HackerNoon Newsletter: Addicted to Your AI? New Research Warns of Social Reward Hacking (6/18/2025) | HackerNoon

3 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?