By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: AI Still Can’t Explain a Joke—or a Metaphor—Like a Human Can | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > AI Still Can’t Explain a Joke—or a Metaphor—Like a Human Can | HackerNoon
Computing

AI Still Can’t Explain a Joke—or a Metaphor—Like a Human Can | HackerNoon

News Room
Last updated: 2025/06/18 at 3:56 PM
News Room Published 18 June 2025
Share
SHARE

Authors:

(1) Arkadiy Saakyan, Columbia University ([email protected]);

(2) Shreyas Kulkarni, Columbia University;

(3) Tuhin Chakrabarty, Columbia University;

(4) Smaranda Muresan, Columbia University.

Editor’s note: this is part 5 of 6 of a study looking at how well large AI models handle figurative language. Read the rest below.

Table of Links

5 Human Evaluation and Error Analysis

We conduct human evaluation of generated explanation to more reliably assess their quality and identify key errors in multimodal figurative language understanding. We recruit two expert annotators with background in linguistics for the task and sample 95 random instances from the test set. For each instance, we first provide the annotators with the image, claim and reference explanation and ask the annotators to choose the right label. If the annotator succeeds, they can view the rest of the task, which consists of 3 explanations from our top models by F1@0 in each category: LLaVAeViL-VF, LLaVA-34B-SG, GPT-4-5shot. The explanations are taken for both correct and incorrect model predictions. For each explanation, we ask whether the explanation is adequate (accurate, correct, complete and concise). If not, we ask them to identify one of the three main types of errors based on the following taxonomy:

• Hallucination: explanation is not faithful to the image, indicating difficulties with basic visual comprehension (see prediction of a blunt tip when the pencil tip is actually sharp in row 1 of Table 5).

• Unsound reasoning: sentences do not adhere to natural logic or violate common sense (e.g., concluding than an upwards arrow and lots of money imply an economic crisis, see row 3).

• Incomplete reasoning: while overall the explanation makes sense, it does not address the key property reasons why the image entails or contradicts the claim (for example, does not address the figurative part in the image, see row 2).

• Too Verbose: the explanation is too verbose to the point it would interfere rather than help one decide the correct label.

5.1 How Do Models Perform According to Humans?

In Table 6, we show adequacy and preference rates for explanations from the 3 systems, where an explanation is deemed adequate if both annotators agreed it is, and inadequate if both agreed it is not. The preference percentage is also taken among instances where the annotators agreed that the model’s explanation is preferred among all the adequate explanations. The average IAA using Cohen’s κ is 0.47, indicating moderate agreement (Cohen, 1960). We observe that the teacher model is leading in terms of the adequacy of the explanations and preference rate, as expected from a larger system equipped for higher quality reasoning and generation capabilities. Yet still only half of its explanations are considered adequate. This further confirms that despite impressive performance on the F1@0 scores, the models are not yet capable of producing adequate textual explanations in many instances.

5.2 What Errors Do Models Make?

We also analyze to understand what type of errors do each model make when they are considered not adequate in the above evaluation. In Figure 7, we illustrate the normalized frequency of error types when both annotators agree that the explanation is not adequate (i.e., out of all errors for this model, what percentage is each type of error?). In general, annotators did not consider verbosity to be a

Table 5: Examples of error types generated explanations.Table 5: Examples of error types generated explanations.

Figure 7: Normalized frequency of main error types in the explanation by model.Figure 7: Normalized frequency of main error types in the explanation by model.

major issue of the systems. For GPT-4, the leading error type is hallucination, indicating the need to improve faithful image recognition even in the most advanced models. For the fine-tuned model and LLaVA-34B-SG, the main error type is unsound reasoning, indicating that it is challenging for the models to reason about multimodal figurative inputs consistently.

5.3 How Well Does the Explanation Score Predict Human Judgment on Adequacy?

We explore whether the proposed explanation score can capture human judgement of explanation adequacy. We collect all instances where both annotators agreed on the adequacy judgement for the explanation. We evaluate if the explanation score described in Section 4.2 can act as a good predictor for the human adequacy judgment. We find that the area under the Precision-Recall curve is 0.79, and the maximum F1 score is 0.77, obtainable at the explanation score threshold of 0.53. Hence, we use this threshold to report the results in Table 3. We also use the threshold of 0.6 since it maximizes F1 such that both precision and recall are above 0.75.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Photo of the Week: Submit your best photo for a chance to be featured
Next Article The Fed's on Pause, but You Shouldn't Be. Here Are 4 Smart Financial Moves to Make Today
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

LG Createboard, interactive screens for education with functions promoted by AI
Mobile
Get this K.K. Slider concert Lego set for its lowest price yet
News
Viking burial reveals woman buried with tiny dog in mysterious ‘boat grave’
News
Understanding Standard Transformer Sizes for Developers | HackerNoon
Computing

You Might also Like

Computing

Understanding Standard Transformer Sizes for Developers | HackerNoon

9 Min Read
Computing

Huawei unveils TruSense System for smart wearables · TechNode

4 Min Read
Computing

Reimagining Memory How Autograph Turns Phone Interviews into a Timeless AI Portrait | HackerNoon

6 Min Read
Computing

Chinese ride-hailer Didi invests in auto tech firm, sells more assets · TechNode

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?