By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Can AI-Generated Context Improve the Quality of Crowdsourced Feedback? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Can AI-Generated Context Improve the Quality of Crowdsourced Feedback? | HackerNoon
Computing

Can AI-Generated Context Improve the Quality of Crowdsourced Feedback? | HackerNoon

News Room
Last updated: 2025/04/08 at 6:18 PM
News Room Published 8 April 2025
Share
SHARE

Authors:

(1) Clemencia Siro, University of Amsterdam, Amsterdam, The Netherlands;

(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, The Netherlands;

(3) Maarten de Rijke, University of Amsterdam, Amsterdam, The Netherlands.

Table of Links

Abstract and 1 Introduction

2 Methodology and 2.1 Experimental data and tasks

2.2 Automatic generation of diverse dialogue contexts

2.3 Crowdsource experiments

2.4 Experimental conditions

2.5 Participants

3 Results and Analysis and 3.1 Data statistics

3.2 RQ1: Effect of varying amount of dialogue context

3.3 RQ2: Effect of automatically generated dialogue context

4 Discussion and Implications

5 Related Work

6 Conclusion, Limitations, and Ethical Considerations

7 Acknowledgements and References

A. Appendix

3.3 RQ2: Effect of automatically generated dialogue context

Label quality. In Phase 2, our experiments aim to establish the impact of presenting annotators with different types of context during crowdsourcing. Different from conventional dialogue context, we provide the annotators with the dialogue summary (C0-sum), the user’s information need in the dialogue (C0-heu and C0-llm). We also aim to uncover if we can improve the quality of the crowdsourced labels in C0 to match those in C7. We calculate the Cohen’s Kappa similar to Section 3.2; see Table 2.

The heuristic approach (C0-heu) yields the highest agreement (Kappa and Tau), indicating a noteworthy degree of agreement in relevance assessments. The LLM-generated context (C0-llm and C0-sum) results in a moderate to substantial level of agreement, signifying a reasonable level of agreement regarding the relevance of the system response. We observe similar results for usefulness. The heuristic approach (C0-heu) again leads with the highest level of agreement (0.71 and 0.59), C0- sum follows with a kappa score of 0.63, while C0- llm has a kappa score of 0.53. This high level of agreement (Kappa) for the two aspects indicates the quality of the labels; the additional context provided, generated either heuristically or with LLMs, is effective in conveying relevant information to annotators, leading to more consistent assessments.

For both relevance and usefulness, C0-heu consistently improves agreement among annotators, while the LLM-generated context (C0-llm and C0- sum) has a substantially lower agreement than C7. This difference reflects the limitations of LLMs in capturing context and generating a factual summary. While they generate coherent text, LLMs sometimes fail to correctly represent the sequential order of the dialogue and users’ language patterns.

Label consistency across conditions. In Figure 4a we report the agreement between the setups in Phase 2 and compare them to C7 (relevance) and C3 (usefulness) due to their high inter-annotator agreement (IAA) and label consistency. For the relevance annotations, varying levels of agreement emerge. There is substantial agreement between C0-heu and C0-llm (59.36%), showing a significant overlap in the labels assigned using both methods, although there are instances where annotators differ in their assessments of relevance. C0-sum exhibits moderate label agreement with C0-llm (62.74%) and C0-heu (65.67%), pointing to relatively similar label assignments across the setups.

We observe similar results for usefulness in Figure 4b. While the heuristically generated approach achieves high IAA, the C0-sum method demonstrates greater consistency with all other setups in terms of usefulness. This suggests that while annotators using the C0-heu approach often agreed on a single label, the chosen label may not have always been the most accurate. We note slightly low agreement levels for a similar label between the three setups, consistent with results in Phase 1. Unlike relevance, which used a binary scale, usefulness was rated on a 1–3 scale. This finer-grained scale may explain the lower agreement compared to relevance, as different types of contextual information can influence usefulness scores.

Regarding RQ2, we show that we can improve the consistency of the labels assigned by crowdworkers in C0 condition by augmenting the current turn with automatically generated supplementary dialogue context. The heuristic approach demonstrates higher consistency in both IAA and label consistency for relevance and usefulness compared to C0 and C7. Providing annotators with the user’s initial utterance expressing their preference, particularly in scenarios lacking context, can significantly enhance the quality and consistency of crowdsourced labels. This approach can yield performance comparable to a setup involving the entire dialogue C7, without imposing the cognitive load of reading an entire conversation on annotators. This streamlines the annotation process and maintains high-quality results, offering a practical strategy for obtaining reliable labels for dialogue evaluation.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Loss-of-Pulse Detection Alerts on the Pixel Watch 3 Are Finally Here
Next Article Weekly Newsletter Ad 2025-04-09 00:00:00
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Happy Birthday, Mr. President | HackerNoon
Computing
My T-Mobile 5G Home Internet Experience: 5 Things I Love and a Few I Don't
News
Huawei Mate 70 series sells out on launch day, pre-orders surpass 6.7 million · TechNode
Computing
RFK Jr. Orders HHS to Give Undocumented Migrants’ Medicaid Data to DHS
Gadget

You Might also Like

Computing

Happy Birthday, Mr. President | HackerNoon

7 Min Read
Computing

Huawei Mate 70 series sells out on launch day, pre-orders surpass 6.7 million · TechNode

1 Min Read
Computing

“I almost got fired from Bamboo”: Day 1-1000 of Belonwus |

14 Min Read
Computing

How to Create and Join Instagram Messenger Rooms – Blog

4 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?