Table of Links
-
Abstract and Introduction
-
SylloBio-NLI
-
Empirical Evaluation
-
Related Work
-
Conclusions
-
Limitations and References
A. Formalization of the SylloBio-NLI Resource Generation Process
B. Formalization of Tasks 1 and 2
C. Dictionary of gene and pathway membership
D. Domain-specific pipeline for creating NL instances and E Accessing LLMs
F. Experimental Details
G. Evaluation Metrics
H. Prompting LLMs – Zero-shot prompts
I. Prompting LLMs – Few-shot prompts
J. Results: Misaligned Instruction-Response
K. Results: Ambiguous Impact of Distractors on Reasoning
L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge
M Supplementary Figures and N Supplementary Tables
K Results: Ambiguous Impact of Distractors on Reasoning
LLMs show a slight sensitivity to increasing number of distractors nD in the prompt, with overall accuracy remaining stable (Figs. 12- 15, Table 4).
While some models struggle with increasing nD, others can leverage few-shot learning to mitigate their impact, though the effect is scheme-dependent. Considering Task 1, in the ZS setting (Fig. 12, Table 6), Gemma-7b shows a significant decline in accuracy as nD increases, particularly in the generalized dilemma (r = −0.643, p = 0.001), generalized modus ponens (r = −0.592, p = 0.002), and generalized modus tollens (r = −0.571, p = 0.004) schemes, indicating a moderate negative correlation. In contrast, in the ZS setting, Mistral-7B Instruct-v0.2 exibits a moderate improvement in accuracy with higher nD , in the generalized modus tollens (r = 0.540, p = 0.006) scheme, reflecting a weak positive correlation overall (r = 0.333, p < 0.001). Considering reasoning accuracy, in the ZS setting (Fig. 14, Table 8), the Gemma-7b model exhibited a substantial drop as the nD increased (r = −0.951, p < 0.001), with an initial low accuracy of 0.3 even with nD = 0 . The steepest declines were observed in hypothetical syllogism 1 (r = −1.0, p < 0.000) and generalized dilemma (r = −1.0, p < 0.000) schemes. For the Gemma-7b-it, the strongest negative correlation between the model and nD was for the hypothetical syllogism 1, generalized modus ponens and generalized contraposition schemes (r = −1.0, p = 0.000). In the FS setting (Fig. 15), the Gemma-7b-it model consistently exhibited significant decreases in reasoning accuracy across all schemes, with the most pronounced effect in hypothetical syllogism 3 and generalized modus tollens (r = −1.0p < 0.000). Interestingly, the Mistral-7B Instruct model in both settings, depending on the scheme, showed a positive or negative significant correlation
The findings underscore the substantial impact of distractors on reasoning accuracy, particularly in complex syllogistic reasoning tasks, revealing that current LLMs are highly susceptible to performance degradation as distractor complexity increases.
:::info
Authors:
(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;
(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;
(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;
(4) Marco Valentino, Idiap Research Institute, Switzerland;
(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.
:::
:::info
This paper is available on arxiv under CC BY-NC-SA 4.0 license.
:::