By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Can LLMs Improve Crowdsourced Evaluation in Dialogue Systems? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Can LLMs Improve Crowdsourced Evaluation in Dialogue Systems? | HackerNoon
Computing

Can LLMs Improve Crowdsourced Evaluation in Dialogue Systems? | HackerNoon

News Room
Last updated: 2025/04/07 at 3:00 PM
News Room Published 7 April 2025
Share
SHARE

Authors:

(1) Clemencia Siro, University of Amsterdam, Amsterdam, The Netherlands;

(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, The Netherlands;

(3) Maarten de Rijke, University of Amsterdam, Amsterdam, The Netherlands.

Table of Links

Abstract and 1 Introduction

2 Methodology and 2.1 Experimental data and tasks

2.2 Automatic generation of diverse dialogue contexts

2.3 Crowdsource experiments

2.4 Experimental conditions

2.5 Participants

3 Results and Analysis and 3.1 Data statistics

3.2 RQ1: Effect of varying amount of dialogue context

3.3 RQ2: Effect of automatically generated dialogue context

4 Discussion and Implications

5 Related Work

6 Conclusion, Limitations, and Ethical Considerations

7 Acknowledgements and References

A. Appendix

2 Methodology

We examine how contextual information about a dialogue affects the consistency of crowdsourced judgments regarding relevance and usefulness of a dialogue response. Here, contextual information refers to the information or conversation that precedes a specific response. We carry out experiments in two phases. Phase 1 involves varying the amount of dialogue context for annotators to answer RQ1. In Phase 2, we vary the type of previous contextual information available to annotators to address RQ2.

2.1 Experimental data and tasks

We use the recommendation dialogue (ReDial) dataset (Li et al., 2018), a conversational movie recommendation dataset, comprising of over 11K dialogues. The dataset is collected using a humanhuman approach, i.e., one person acts as the movie seeker, while the other is the recommender with the goal of recommending a suitable movie to the seeker, thus making the dataset goal-oriented. We randomly select system responses from 40 dialogues for the assignment of relevance and usefulness labels. These dialogues typically consist of 10 to 11 utterances each, with an average utterance length of 14 words. We evaluate the same system responses across all experimental conditions.

The annotation task for the annotators involves two dimensions: (i) relevance: Is the system response relevant to the user’s request, considering the context of the dialogue? And (ii) usefulness: How useful is the system’s response given the user’s information need? For the relevance task we ask annotators to judge how relevant the system’s recommendations are to the user’s request (Alonso et al., 2008). First, the annotator has to judge whether the system response includes a movie recommendation or not; if yes, the annotator assesses whether the movie meets the user’s preference; if not, we ask them to note that the utterance does not recommend a movie. The judgment is on a binary scale for the latter case, where the movie is either relevant (1) or not (0). For each experimental condition (see below), annotators only assess the system response with access to the previous context. Note that we forego the user’s feedback on the evaluated response (next user utterance) so as to focus on topical relevance of the recommended movie, that is, if the movie meets the user request and preference in terms of the genre, actor, director, etc. For the usefulness task annotators assess a response with or without a movie recommendation with the aim of determining how useful the system’s response is to the user (Mao et al., 2016). The judgment is done on a three-point scale (i.e., very, somewhat, and not useful). Unlike the relevance task, annotators have access to the user’s next utterance for the usefulness task; usefulness is personalized to the user, in that even though a movie may be in the same genre, sometimes a user may not like it (e.g., does not like the main actor), thus making the system response relevant but not useful to the user.

2.2 Automatic generation of diverse dialogue contexts

User information need. The user’s information need plays a significant role when assessing or improving the quality of the data collected in IR systems (Mao et al., 2016). It refers to the specific requirement or query made by a user, which guides the system in understanding their preferences and retrieving relevant information to fulfill that need. For TDSs, understanding the user’s intent is crucial for annotators participating in the evaluation, as they are not the actual end users. This understanding improves the alignment of evaluation labels with the actual user’s requirements. We define the user’s information need as their movie recommendation preference. Given the consistency of user preferences in the ReDial dataset, where users tend to maintain a single preference throughout a conversation, providing the user’s initial information need aids annotators in evaluating the current turn for relevance or usefulness.

We adopt two approaches to generate the user’s information need. One is to heuristically extract the first user utterance that either requests a movie recommendation or expresses a movie preference, based on phrases such as “looking for,” “recommend me,” and “prefer.” These phrases are extracted from the first three user utterances in a dialogue, with the top 10 most common phrases selected. The second approach relies on LLMs to generate the user’s information need. We hypothesize that LLMs can identify pertinent user utterances in a dialogue and generate the corresponding information need. We use GPT-4 (OpenAI, 2023) in a zero-shot setting; with the dialogue context up to the current turn as input, we prompt the model to generate the user’s information need.

Generating dialogue summaries. Dialogue summarization is beneficial for providing a quick context to new participants of a conversation and helping people understand the main ideas or search for key contents after the conversation, which can increase efficiency and productivity (Feng et al., 2022). We use dialogue summaries to provide annotators with quick prior context of a dialogue. We use GPT-4 (OpenAI, 2023) in a zero-shot setting, as in the case of user information needs, but vary the prompt. We instruct GPT-4 to generate a summary that is both concise and informative, constituting less than half the length of the input dialogue. Both the generated user information needs and summaries are incorporated in Phase 2 of the crowdsourcing experiments.

Due LLMs’ potential for hallucination (Bouyamourn, 2023; Chang et al., 2023), we evaluate the generated summaries and user information need to ensure factuality and coherence. We elaborate the steps we took in Section A.2.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Meta exec denies the company artificially boosted Llama 4’s benchmark scores | News
Next Article Millions more TVs to lose popular feature in WEEKS as owners want ‘compensation’
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

These are 5 tricks I use and recommend to improve reading on any Android device
News
Fisherman missing feared dead after ‘being attacked by TWO SHARKS’
News
Log your weight in Apple Health quickly and cheaply with this iPhone-compatible smart scale – 9to5Mac
News
Sky is bringing back TWO channels tomorrow after they disappeared from TV guides
News

You Might also Like

Computing

How to Plan Your PR Calendar for 2025 (+Templates) |

32 Min Read
Computing

vs. Todoist: Which Productivity Tool Is Better? |

29 Min Read
Computing

The TechBeat: I Built an AI Copilot That Thinks in Exploits, Not Prompts (7/6/2025) | HackerNoon

5 Min Read
Computing

Fedora 43 Looks To Zstd-Compressed Initrd By Default For Space Savings & Faster Boots

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?