By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Evaluating TnT-LLM: Automatic, Human, and LLM-Based Assessment | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Evaluating TnT-LLM: Automatic, Human, and LLM-Based Assessment | HackerNoon
Computing

Evaluating TnT-LLM: Automatic, Human, and LLM-Based Assessment | HackerNoon

News Room
Last updated: 2025/04/19 at 1:38 PM
News Room Published 19 April 2025
Share
SHARE

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Method and 3.1 Phase 1: Taxonomy Generation

3.2 Phase 2: LLM-Augmented Text Classification

4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies

4.2 Phase 2 Evaluation Strategies

5 Experiments and 5.1 Data

5.2 Taxonomy Generation

5.3 LLM-Augmented Text Classification

5.4 Summary of Findings and Suggestions

6 Discussion and Future Work, and References

A. Taxonomies

B. Additional Results

C. Implementation Details

D. Prompt Templates

4 EVALUATION SUITE

Due to the unsupervised nature of the problem we study and the lack of a benchmark standard, performing quantitative evaluation on end-to-end taxonomy generation and text classification can be challenging. We therefore design a suite of strategies to evaluate TnT-LLM. Our evaluation strategies may be categorized into three buckets, depending on the type and source of the evaluation criteria. The three categories are as follows:

• Deterministic automatic evaluation: This type of approach is scalable and consistent, but requires well-defined, gold standard rules and annotations. It is less applicable for evaluating the abstract aspects studied in this paper, such as the quality and usefulness of a label taxonomy.

• Human evaluation: These approaches are useful for evaluating the abstract aspects that the automatic evaluations cannot address. However, they are also time-consuming, expensive, and may encounter data privacy and compliance constraints.

• LLM-based evaluations: Here, LLMs are used to perform the same or similar tasks as human evaluators. This type of evaluation is more scalable and cost-effective than human evaluation, albeit potentially subject to biases and errors if not applied properly. We therefore aim to combine and validate LLM-based evaluation with human evaluation metrics on small corpora so that we can extrapolate conclusions with sufficient statistical power.

4.1 Phase 1 Evaluation Strategies

Following prior studies [23, 30], we evaluate a label taxonomy on three criteria: Coverage, accuracy, and relevance to the use-case instruction. Note that we require implementing the native primary label assignment to apply these metrics. For clustering-based methods, this is instantiated through the clustering algorithm. For TnTLLM, this is done by a label assignment prompt as described in Section 3.2. We also note that the label accuracy and use-case relevance metrics discussed here are applicable to both human and LLM raters.

Taxonomy Coverage. This metric measures the comprehensiveness of the generated label taxonomy for the corpus. Conventional text clustering approaches (e.g., embedding-based k-means) often achieve 100% coverage by design. In our LLM-based taxonomy generation pipeline, we add an ‘Other’ or ‘Undefined’ category in the label assignment prompt by design and measure the proportion of data points assigned to this category. The lower this proportion, the higher the taxonomy coverage.

Label Accuracy. This metric quantifies how well the assigned label reflects the text data point, relative to other labels in the same taxonomy. Analogous to mixture model clustering, the primary label should be the most probable one given the text. We assume human and LLM raters can assess the label fit by its name and description. We treat accuracy as a pairwise comparison task: for each text, we obtain the primary label and a random negative label from the same taxonomy, and ask a rater to choose the more accurate label based on their names and descriptions.[1] If the rater correctly identifies the positive label, we consider it as a “Hit” and report the average hit rate as the label accuracy metric. We do not explicitly evaluate the overlap across category labels and rather expect it to be implicitly reflected in the pairwise label accuracy metric.

Relevance to Use-case Instruction. This metric measures how relevant the generated label taxonomy is to the use-case instruction. For example, “Content Creation” is relevant to an instruction to “understand user intent in a conversation”, while “History and Culture” is not. We operationalize this as a binary rating task: for each instance, we provide its primary label name and description to a human or LLM rater, and ask them to decide if the label is relevant to the given use-case instruction or not. Note that we instruct the rater to use the presented instance as the context, and rate the relevance conditioned on the label’s ability to accurately describe some aspect of the text input. The goal of this metric is not to evaluate the label accuracy, but rather to rule out the randomness introduced by taxonomies that are seemingly relevant to the use-case instruction, but irrelevant to the corpus sample – and therefore useless for downstream applications.


[1] Raters are also offered a “None” option besides the pair, but are instructed to minimize the use of it.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article I’m 18 but lost $70,000 – expert said ‘dynamite’ move could save me from debt
Next Article I ditched Strava Premium for Garmin Connect Plus, but I’m switching back
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Published Fiction at Center of Fair Use Dispute in Anthropic AI Training Lawsuit | HackerNoon
Computing
BenQ GV50
Gadget
Two new stars explode into sky suddenly in ‘historical extremely rare event’
News
Report: Apple’s India Manufacturing Dream in Jeopardy Over Exodus of Chinese Workers
News

You Might also Like

Computing

Published Fiction at Center of Fair Use Dispute in Anthropic AI Training Lawsuit | HackerNoon

3 Min Read
Computing

Mesa’s Zink Preps NV_timeline_semaphore For Better OpenGL-Vulkan Interoperability

1 Min Read
Computing

Tech ministers push for sovereign data and enhanced cybersecurity

9 Min Read
Computing

U.S. Sanctions Russian Bulletproof Hosting Provider for Supporting Cybercriminals Behind Ransomware

4 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?