Can LLMs Improve Crowdsourced Evaluation In Dialogue Systems?

Authors:

(1) Clemencia Siro, University of Amsterdam, Amsterdam, The Netherlands;

(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, The Netherlands;

(3) Maarten de Rijke, University of Amsterdam, Amsterdam, The Netherlands.

Table of Links

Abstract and 1 Introduction

2 Methodology and 2.1 Experimental data and tasks

2.2 Automatic generation of diverse dialogue contexts

2.3 Crowdsource experiments

2.4 Experimental conditions

2.5 Participants

3 Results and Analysis and 3.1 Data statistics

3.2 RQ1: Effect of varying amount of dialogue context

3.3 RQ2: Effect of automatically generated dialogue context

4 Discussion and Implications

5 Related Work

6 Conclusion, Limitations, and Ethical Considerations

7 Acknowledgements and References

A. Appendix

2 Methodology

We examine how contextual information about a dialogue affects the consistency of crowdsourced judgments regarding relevance and usefulness of a dialogue response. Here, contextual information refers to the information or conversation that precedes a specific response. We carry out experiments in two phases. Phase 1 involves varying the amount of dialogue context for annotators to answer RQ1. In Phase 2, we vary the type of previous contextual information available to annotators to address RQ2.

2.1 Experimental data and tasks

We use the recommendation dialogue (ReDial) dataset (Li et al., 2018), a conversational movie recommendation dataset, comprising of over 11K dialogues. The dataset is collected using a humanhuman approach, i.e., one person acts as the movie seeker, while the other is the recommender with the goal of recommending a suitable movie to the seeker, thus making the dataset goal-oriented. We randomly select system responses from 40 dialogues for the assignment of relevance and usefulness labels. These dialogues typically consist of 10 to 11 utterances each, with an average utterance length of 14 words. We evaluate the same system responses across all experimental conditions.

The annotation task for the annotators involves two dimensions: (i) relevance: Is the system response relevant to the user’s request, considering the context of the dialogue? And (ii) usefulness: How useful is the system’s response given the user’s information need? For the relevance task we ask annotators to judge how relevant the system’s recommendations are to the user’s request (Alonso et al., 2008). First, the annotator has to judge whether the system response includes a movie recommendation or not; if yes, the annotator assesses whether the movie meets the user’s preference; if not, we ask them to note that the utterance does not recommend a movie. The judgment is on a binary scale for the latter case, where the movie is either relevant (1) or not (0). For each experimental condition (see below), annotators only assess the system response with access to the previous context. Note that we forego the user’s feedback on the evaluated response (next user utterance) so as to focus on topical relevance of the recommended movie, that is, if the movie meets the user request and preference in terms of the genre, actor, director, etc. For the usefulness task annotators assess a response with or without a movie recommendation with the aim of determining how useful the system’s response is to the user (Mao et al., 2016). The judgment is done on a three-point scale (i.e., very, somewhat, and not useful). Unlike the relevance task, annotators have access to the user’s next utterance for the usefulness task; usefulness is personalized to the user, in that even though a movie may be in the same genre, sometimes a user may not like it (e.g., does not like the main actor), thus making the system response relevant but not useful to the user.

2.2 Automatic generation of diverse dialogue contexts

User information need. The user’s information need plays a significant role when assessing or improving the quality of the data collected in IR systems (Mao et al., 2016). It refers to the specific requirement or query made by a user, which guides the system in understanding their preferences and retrieving relevant information to fulfill that need. For TDSs, understanding the user’s intent is crucial for annotators participating in the evaluation, as they are not the actual end users. This understanding improves the alignment of evaluation labels with the actual user’s requirements. We define the user’s information need as their movie recommendation preference. Given the consistency of user preferences in the ReDial dataset, where users tend to maintain a single preference throughout a conversation, providing the user’s initial information need aids annotators in evaluating the current turn for relevance or usefulness.

We adopt two approaches to generate the user’s information need. One is to heuristically extract the first user utterance that either requests a movie recommendation or expresses a movie preference, based on phrases such as “looking for,” “recommend me,” and “prefer.” These phrases are extracted from the first three user utterances in a dialogue, with the top 10 most common phrases selected. The second approach relies on LLMs to generate the user’s information need. We hypothesize that LLMs can identify pertinent user utterances in a dialogue and generate the corresponding information need. We use GPT-4 (OpenAI, 2023) in a zero-shot setting; with the dialogue context up to the current turn as input, we prompt the model to generate the user’s information need.

Generating dialogue summaries. Dialogue summarization is beneficial for providing a quick context to new participants of a conversation and helping people understand the main ideas or search for key contents after the conversation, which can increase efficiency and productivity (Feng et al., 2022). We use dialogue summaries to provide annotators with quick prior context of a dialogue. We use GPT-4 (OpenAI, 2023) in a zero-shot setting, as in the case of user information needs, but vary the prompt. We instruct GPT-4 to generate a summary that is both concise and informative, constituting less than half the length of the input dialogue. Both the generated user information needs and summaries are incorporated in Phase 2 of the crowdsourcing experiments.

Due LLMs’ potential for hallucination (Bouyamourn, 2023; Chang et al., 2023), we evaluate the generated summaries and user information need to ensure factuality and coherence. We elaborate the steps we took in Section A.2.

Can LLMs Improve Crowdsourced Evaluation in Dialogue Systems? | HackerNoon

Table of Links

2 Methodology

2.1 Experimental data and tasks

2.2 Automatic generation of diverse dialogue contexts

Leave a Reply Cancel reply

Stay Connected

Latest News

Appeals panel clears way for DOGE access to sensitive personal data at OPM, Education Department

Study: Most CEOs Suspect Employees Use AI Without Approval

Big Tech Doesn’t Want to “Fix” Healthcare—It Wants to Own It | HackerNoon

UK SMEs losing over £3bn a year to cyber incidents | Computer Weekly

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

2 Methodology

2.1 Experimental data and tasks

2.2 Automatic generation of diverse dialogue contexts

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News