Table of Links
Abstract and 1 Introduction
2 Related Work
3 Method and 3.1 Phase 1: Taxonomy Generation
3.2 Phase 2: LLM-Augmented Text Classification
4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies
4.2 Phase 2 Evaluation Strategies
5 Experiments and 5.1 Data
5.2 Taxonomy Generation
5.3 LLM-Augmented Text Classification
5.4 Summary of Findings and Suggestions
6 Discussion and Future Work, and References
A. Taxonomies
B. Additional Results
C. Implementation Details
D. Prompt Templates
C IMPLEMENTATION DETAILS
C.1 Pipeline Design and Detailed Techniques
We discuss the details of our LLM-based framework in this section. The rationale of these design details is to ensure that our proposed framework is executable, robust, and can be validated via quantitative metrics.
Executability and Robustness. A key challenge is how to reliably execute the framework, especially when a prompt chain is involved where the states are dependent on the previous outputs. To address this, we explicitly state the output format in our prompts using predefined xml tags, such as “output taxonomy in markdown table format”. This allows us to parse the outcomes from each step of the prompt chain and feed them to the next step. Moreover, we instruct the LLMs to format the taxonomy as a markdown table with a predefined schema, which includes the name, description, and index of each label. By asking the LLMs to output the name and the index of the assigned label together, we improve the consistency of the label assignment outputs and reduce the potential post-processing effort.
However, we acknowledge that LLMs may not always follow the format instruction perfectly. Therefore, we propose the following strategy to increase the robustness of the pipeline execution. Specifically, we design a few guardrail tests for each type of LLM prompts. These tests include: 1) checking whether the output from a prompt adheres to the specified format that can be successfully parsed; 2) verifying whether the output is in the correct language (English) specified in the prompt, especially for the summarization prompt; 3) ensuring whether the output satisfies a key verifiable requirement given in the prompt instruction, such as the maximum number of labels in the output taxonomy. These metrics not only measure the instruction-following ability of an LLM system, but also provide
a quality assurance test suite to enhance the executability of the framework.
We also specify a maximum number of retries (5 in our experiments) and a base temperature for each LLM call. If the outcome from an LLM call cannot pass the guardrail tests, we increase the temperature by 0.1 and allow it to try again until reaching the limit. Although there are still cases where LLMs fail to follow the instruction after exhausting the retry quota, empirically we find that this strategy largely increases the executability of our LLM-based framework.
“Model” Selection. We draw inspiration from the practice of applying stochastic gradient descent in classic machine learning optimization. Our taxonomy generation and update pipeline does not guarantee the convergence to a global, but we can leverage an external validation step to perform ‘model’ selection in a more principled way. To this end, we devise an evaluation prompt that takes as input a pair of or multiple taxonomies, a batch of text summaries, a use case instruction along with the taxonomy requirements, and then outputs the index of the taxonomy that best fits the data and complies with the requirements.[4] We apply the evaluation prompt on a validation set that is separate from the training set used by the update prompts. After each or every few update steps, we compare the updated taxonomy and the best taxonomy that has been tracked on the validation set using the evaluation prompt. Once the update prompt chain is completed, the best taxonomy is passed to the final review step. This process simulates the conventional stochastic optimization practices, where the ‘early stopping’ criteria can also be applied.
Efficiency Analysis and Sample Size Suggestion. The efficiency of our pipeline depends on the choice of the corpus sample size and the LLM system for each phase. For the taxonomy generation phase (Phase 1), we suggest using a ‘small-to-medium’ size corpus sample that is representative of the whole corpus. The sample size (𝑁) should be large enough to capture the diversity of the corpus, but not too large to incur unnecessary computational costs. In our experiments, we found that a sample size around 10k was sufficient to produce a high quality taxonomy with no more than 100 labels. The most computationally intensive stage of this phase is the summarization stage, which requires calling an LLM at least 𝑁 times to generate summaries for the entire corpus sample. This stage can be skipped if the input texts are short and normative, or replaced by a cheaper or more specialized summarization model. The generation and update prompt chain requires an LLM system with high reasoning capacity and large context window. We used GPT-4 (with 32k context window) and GPT-3.5-Turbo (with 16k context window) in our experiments, and was able to achieve efficiency with proper batching (with a batch size of 200). We observed that GPT-3.5-Turbo was 5x-10x faster than GPT-4, but may compromise the quality of the final label taxonomy outcome.
For the label assignment and classifier development phase (Phase 2), we recommend using a ‘medium-to-large’ size corpus sample that covers the range of labels in the taxonomy. The sample size needed also depends on the difficulty of the classification task and the effectiveness of the representation model used. Since this phase involves applying an LLM on the entire sample, we suggest starting with a ‘medium’ size sample for model development, and increasing it as needed.
C.2 Experiment Details
LLM Configurations. We used the following fixed parameter configurations for all prompts applied in this work: frequency_penalty=0,presence_penalty=0, top_p=0.5. We purposely apply a higher temperature for the taxonomy generation prompt chain to elicit the generation power of LLMs. The base temperature is set to 0.5 for the “generation” prompt, and 0.2 for the “update” prompt. Base temperature is set to 0 for all other prompts in our experiments.
Hyperparameter Selection. For classifiers presented in Section 5.3, we perform grid search based on their accuracy performance on the validation set as the follows.
• Logistic Regression: An ℓ2 regularizor is applied and 𝜆 is selected from [0.01, 0.1, 1, 10].
• LightGBM: We use the default number of leaves (31) in the official LightGBM package and the maximum depth is selected from [3, 5, 7, 9].
• MLP: We apply an Adam [14] optimizer with weight decay set to 1𝑒 − 5 and a learning rate 0.001.
Instruction Following Results. In addition to the results reported in Sections 5.2 and 5.3, we also evaluate the instruction following ability of the two LLM systems applied in our experiments. For the first summarization stage of our proposed taxonomy generation pipeline (Stage 1 in Section 3.1), we primarily evaluate 1) if the output can be successfully parsed based on the predefined format in the prompt (i.e., format check) and 2) if the output complies with the language specified in the prompt (i.e., English). We found that GPT4 performed flawlessly, passing 100% of the format and language checks. GPT-3.5-Turbo, on the other hand, had a very low failure rate for the format check (<0.01%) and a slightly higher failure rate for the language check (around 2%). However, we also notice that 0.3% of GPT-3.5-Turbo outputs passed the strict format check, but copied the instruction into the XML tags. Given the overall instruction following success rate is high and our taxonomy generation pipeline is relatively robust to minor perturbations of the input batch, we discard the conversations that did not pass the instruction following test for GPT-3.5-Turbo in the subsequent stage. For the taxonomy generation and update stage (Stage 2 in Section 3.1), we evaluate if the prompt chain can successfully complete for each of 10 epoch runs, which requires that all the intermediate taxonomy outcomes 1) can be successfully parsed (i.e., format check) and 2) comply with the predefined taxonomy size limit (i.e., max number of generated labels). GPT-4 again performed flawlessly, completing 10 out of 10 epochs for both taxonomies. GPT-3.5-Turbo, however, struggled on this stage, primarily because of it persistently exceeding the taxonomy size limit in the ‘Update’ step. At the end, it only completed 4 out of 10 epochs for the intent taxonomy and 1 out of 10 epochs for the domain taxonomy. For the native label assignment stage, we find both GPT-4 and GPT-3.5-Turbo are able to pass the format check close to 100%.
[4] Note to mitigate the potential position bias [16] in such kind of single-choice or pairwise selection evaluation, we always randomize the position of each option and run the evaluation multiple times in all of our experiments.
Authors:
(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;
(2) Tara Safavi (Corresponding authors), Microsoft Corporation;
(3) Sujay Kumar Jauhar, Microsoft Corporation;
(4) Yujin Kim, Microsoft Corporation;
(5) Scott Counts, Microsoft Corporation;
(6) Jennifer Neville, Microsoft Corporation;
(7) Siddharth Suri, Microsoft Corporation;
(8) Chirag Shah, University of Washington and Work done while working at Microsoft;
(9) Ryen W. White, Microsoft Corporation;
(10) Longqi Yang, Microsoft Corporation;
(11) Reid Andersen, Microsoft Corporation;
(12) Georg Buscher, Microsoft Corporation;
(13) Dhruv Joshi, Microsoft Corporation;
(14) Nagu Rangan, Microsoft Corporation.