Benchmarking AnLLMs: Insights From OpenBookQA To BoolQ

Benchmarking AnLLMs: Insights from OpenBookQA to BoolQ | HackerNoon

Last updated: 2024/10/11 at 1:18 AM

News Room Published 11 October 2024

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References

A More Experimental Results

B Data Settings

4.3 Evaluation

In our investigation, we employ a diverse collection of benchmarks with varying text lengths to evaluate our outcomes, including OpenBookQA (OBQA) (Mihaylov et al., 2018), WinoGrande (WG) (Sakaguchi et al., 2021), ARC-easy (ARC-e) and ARCchallenge (ARC-c) (Clark et al., 2018), PIQA (Bisk et al., 2020), HellaSwag (HS) (Zellers et al., 2019), SCIQ (Welbl et al., 2017), and BoolQ (Clark et al., 2019). These benchmarks provide a comprehensive evaluation of various aspects, including reasoning, comprehension, understanding of the physical world, and predicting future events. Importantly, they cover texts of varying lengths, facilitating a thorough assessment of our model’s performance across diverse tasks and text complexities, ranging from shorter input contexts in OBQA to longer texts in BoolQ. To measure the precision and efficiency of our models, we evaluate them across three dimensions using three distinct metrics for both zero-shot and five-shot settings. For AnLLMAC in the five-shot setting, we incorporate the anchor token at the end of each demonstration.

• Accuracy (Acc). This conventional metric is utilized to gauge the prediction accuracy of models. In accordance with previous studies (Gao et al., 2023), we choose the options with the highest probabilities as predictions and calculate accuracy using the gold-standard labels.

• Keys/Values Caches Reduction (C⇓). In the context of the five-shot evaluation, the demonstrations can be cached in GPU memory for subsequent reuse. Nevertheless, extended demonstrations may require increased memory consumption. This metric is designed to assess the memory efficiency of the AnSAN technique.

• Inference Acceleration Ratio (T⇑). Similar to Wang et al. (2023), capitalizing on the cached keys/values, we present the Inference acceleration ratio, which serves as an indicator of the inference efficiency of the AnSAN technique.

Note that we first report full attention inference results for all models, then present results with the AnSAN method (+AnSAN) applied, compressing sequence information into anchor tokens.

Benchmarking AnLLMs: Insights from OpenBookQA to BoolQ | HackerNoon

Table of Links

4.3 Evaluation

Leave a Reply Cancel reply

Stay Connected

Latest News

Nvidia completes its $700 million acquisition of AI startup Run:ai

Cloudflare Makes Open-Source h3i For HTTP/3 Testing & Debugging

The 10 best-ever gaming controllers – ranked! | Stuff

Apple honors Jimmy Carter with a rare homepage takeover

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

4.3 Evaluation

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News