By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: How Reliable Are Human Judgments in AI Model Testing? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > How Reliable Are Human Judgments in AI Model Testing? | HackerNoon
Computing

How Reliable Are Human Judgments in AI Model Testing? | HackerNoon

News Room
Last updated: 2025/05/19 at 7:59 PM
News Room Published 19 May 2025
Share
SHARE

Table of Links

Abstract and 1 Introduction

2 Pre-Training

2.1 Tokenization

2.2 Pre-Training Data

2.3 Stability

2.4 Inference

3 Alignment and 3.1 Data

3.2 Fine-Tuning Strategy

4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation

4.2 Baselines and Evaluations

4.3 Inter-annotator Agreement

4.4 Safety Testing

4.5 Discussion

5 Benchmark Evaluations and 5.1 Text

5.2 Image-To-Text

6 Related Work

7 Conclusion, Acknowledgements, Contributors, and References

Appendix

A. Samples

B. Additional Information of Human Evaluations

4.3 Inter-annotator Agreement

Every question in our evaluation is answered by three different human annotators, and we take the majority votes as the final answer. To understand the quality of the human annotators and whether the questions we asked are reasonably designed, we examine the level of agreement between different annotators.

Figure 10 The inter-annotator agreement on the questions in the absolute evaluation.Figure 10 The inter-annotator agreement on the questions in the absolute evaluation.

For questions about simple, objective properties of the responses, we very rarely see three annotators disagree with each other. For example, annotators have unanimous judgments on whether the model responses contain objectionable content (e.g., hate speech); in this case, all models produce safe responses. For some questions, such as whether the response fulfills the task or whether the model interprets the prompt correctly, when one annotator’s judgment differs from the other two’s, the decision is usually still close (e.g., fulfills vs. partially fulfills) rather than opposite (e.g., fulfills vs. does not fulfill).[3]

Table 4 The inter-annotator agreement on relative evaluations.Table 4 The inter-annotator agreement on relative evaluations.

For the relative evaluation, Table 4 shows the numbers of cases where all three annotators agree, two annotators agree, and there is no agreement. For each model pair, we have a bit higher than 10% of the cases where there is no agreement among the three annotators (considered as a tie in our evaluation.) On about 28% to 35% of the pairs, all annotators have unanimous judgments, and in about 55% to 60% of the pairs, one annotator differs from other two. This may be interpreted as Chameleon performing similarly to other baselines in many cases, making the relative evaluation challenging.[4]

4.4 Safety Testing

We crowdsource prompts that provoke the model to create unsafe content in predefined categories such as self-harm, violence and hate, and criminal planning. These prompts cover both text and mixed-modal inputs, as well as intents to produce unsafe text, images, or mixed-modal outputs. We generate the model’s response to each prompt, and ask annotators to label whether the response is safe or unsafe with respect to each category’s definition of safety; an unsure option is also provided for borderline responses. Table 5 shows that an overwhelming majority of Chameleon’s responses are considered safe, with only 78 (0.39%) unsafe responses for the 7B model and 19 (0.095%) for the 30B model.

We also evaluate the model’s ability to withstand adversarial prompting in an interactive session. For that purpose, an internal red team probed the 30B model over 445 prompt-response interactions, including multi-turn interactions. Table 5 shows that of those responses, 7 (1.6%) were considered unsafe and 20 (4.5%) were labeled as unsure. While further safety tuning using RLHF/RLAIF has been shown to further harden the model against jailbreaking and intentional malicious attacks, these results demonstrate that our current safety tuning approach provides significant protection for reasonable, benign usage of this research artifact.

4.5 Discussion

Compared to Gemini and GPT-4V, Chameleon is very competitive when handling prompts that expect interleaving, mixed-modal responses. The images generated by Chameleon are usually relevant to the context, making the documents with interleaving text and images very appealing to users. However, readers should be aware of the limitations of human evaluation. First, the prompts used in the evaluation came from crowdsourcing instead of real users who interact with a model. While we certainly have a diverse set of prompts, the coverage may still be limited, given the size of the dataset. Second, partially because our prompts focus on the mixed-modal output, certain visual understanding tasks, such as OCR or Infographics (i.e., interpreting a given chart or plot), are naturally excluded from our evaluation. Finally, at this moment, the APIs of existing multi-modal LLMs provide only textual responses. While we strengthen the baselines by augmenting their output with separately generated images, it is still preferred if we can compare Chameleon to other native mixed-modal models.

Table 5 Safety testing on 20,000 crowdsourced prompts and 445 red team interactions provoking the model to produce unsafe content.Table 5 Safety testing on 20,000 crowdsourced prompts and 445 red team interactions provoking the model to produce unsafe content.

Author:

(1) Chameleon Team, FAIR at Meta.


[3] For the question of task fulfillment, the inter-rater reliability derived by Krippendorff’s Alpha (Krippendorff, 2018; Marzi et al., 2024) is 0.338; the 95% confidence interval is [0.319, 0.356], based on bootstrap sampling of 1,000 iterations.

[4] When comparing Chameleon with Gemini+ and GPT-4V+, the Krippendorff’s Alpha values are 0.337 [0.293, 0.378] and 0.396 [0.353, 0.435], respectively.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Here is a first look at Xiaomi’s next best affordable tablet
Next Article Did Joe Biden reveal he had prostate cancer in a 2022 speech slip-up?
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

SAIC builds LNG ships for car exports as new EU policies kick in · TechNode
Computing
Mortgage Rate Predictions for May 19- 25, 2025
News
Acer’s first connected ring hides an asset
Mobile
5 shows like ‘Forever’ to stream right now
News

You Might also Like

Computing

SAIC builds LNG ships for car exports as new EU policies kick in · TechNode

4 Min Read
Computing

Malicious PyPI Packages Exploit Instagram and TikTok APIs to Validate User Accounts

6 Min Read
Computing

2023 TechNode Content Team Annual Insights: A Whole Year Surprised by Altman · TechNode

9 Min Read
Computing

Tesla exports materials from China for 4680 battery production in Texas: report · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?