Authors:
(1) Pranjali Awasthi;
(2) David Recio-Mitter;
(3) Yosuke Kyle Sugi.
Table of Links
- Abstract and Introduction
- Precision tuning for protein modeling
- QA Task Performance
- Results and References
ABSTRACT
Language models have become increasingly popular in recent years for tasks like information retrieval. As use-cases become oriented toward specific domains, fine-tuning becomes default for standard performance. To fine-tune these models for specific tasks and datasets, it is necessary to carefully tune the model’s hyperparameters and training techniques. In this paper, we present an indepth analysis of the performance of four transformer-based language models on the task of biomedical information retrieval. The models we consider are DeepMind’s RETRO (7B parameters), GPT-J (6B parameters), GPT-3 (175B parameters), and BLOOM (176B parameters). We compare their performance on the basis of relevance, accuracy, and interpretability, using a large corpus of 480000 research papers on protein structure/function prediction as our dataset. Our findings suggest that smaller models, with <10B parameters and fine-tuned on domain-specific datasets, tend to outperform larger language models on highly specific questions in terms of accuracy, relevancy, and interpretability by a significant margin (+50% on average). However, larger models do provide generally better results on broader prompts.
Introduction
Recent advancements in natural language processing have enabled us to build increasingly powerful models capable of understanding complex queries and providing accurate and relevant responses. As these models become more sophisticated, their ability to handle complex tasks is increasing. However, there is still room for improvement when it comes to maximizing use-case specificity through precision model tuning. This paper investigates how smaller models can be used to optimize performance for highly specific tasks.
Precision tuning, also known as surgical fine-tuning, refers to the process of selectively fine-tuning a subset of layers in a pre-trained model. This approach allows for the preservation of learned features while adapting to the new task, and has been shown to match or outperform other fine-tuning approaches in certain settings. In the context of biomedical information retrieval, precision tuning can be particularly beneficial in maximizing use-case specificity.
One of the primary benefits of precision tuning is the ability to improve the accuracy of the model. In the field of biomedical information retrieval, accuracy is crucial as incorrect or biased information can have serious consequences. For example, a model that retrieves incorrect information about a particular protein structure or function could lead to flawed research or incorrect treatment recommendations. By fine-tuning only the most relevant layers for the specific use case, precision tuning allows for a more focused and accurate model.
Language models have become a crucial component of natural language processing (NLP) tasks, particularly in the field of information retrieval. These models, which are trained to predict the probability distribution of a sequence of words, are typically large and complex, with parameters numbering in the billions. However, recent research has suggested that smaller language models, with fewer parameters, may be more effective in specific use cases, particularly when it comes to highly specific queries. Accuracy is highly important in biomedical information retrieval because it directly affects the quality of the information being retrieved. In the field of medicine, incorrect or misleading information can have serious consequences for patients, healthcare professionals, and the overall healthcare system. This is especially true in the case of protein structure/function prediction, which is a highly complex and nuanced area of research.
Meta AI’s Galactica, a large language model (120B parameters) designed to assist scientists with relevant scientific compositions, was taken down just three days after its launch due to its inability to accurately solve basic mathematical questions and its tendency to generate gibberish. This incident highlights the importance of accuracy in biomedical information retrieval, as incorrect or biased information can have serious consequences in the field of medicine.
One way to counter the effects of entropy in large language models is through precision model tuning, which involves fine-tuning a model on a specific task or dataset in order to increase its performance. This process is especially effective when applied to smaller models, which have fewer parameters and are therefore less prone to overfitting. By focusing on a specific task or dataset, smaller models can be more accurate and relevant in their retrieval of information, leading to better outcomes for the end user.
Entropy is a measure of the amount of disorder or randomness in a system. In the context of language models, entropy refers to the amount of randomness or variability in the model’s output. In biomedical information retrieval, it is important to minimize entropy in order to ensure the accuracy and reliability of the retrieved information.
Hyper-tuning smaller models is a technique that can be used to maximize the use-case specificity of a model, thereby reducing entropy and improving the accuracy of the retrieved information. This is because smaller models are able to focus on a specific use case and are less likely to produce irrelevant or incorrect information compared to larger models. In the case of Meta AI’s Galactica, the large language model was trained on a broad range of scientific data, which likely resulted in a high level of entropy in the model’s output. This may have contributed to the model’s inability to accurately generate scientific content and solve basic mathematical problems. By contrast, hyper-tuning a smaller model on a specific domain, such as protein structure prediction, may result in a model with lower entropy and higher accuracy in that specific domain.
In this paper, we compare the performance of two smaller language models, DeepMind’s RETRO model (7B parameters) and GPT-J (6B parameters), with two larger models, GPT-3 (175B parameters) and BLOOM (176B parameters), on the task of biomedical information retrieval. This task involves searching through a large corpus of research papers, in this case 480000 papers on protein structure and function prediction, to find relevant and accurate information in response to a query. We evaluate the models based on three key metrics: relevance, accuracy, and interpretability.
• Relevance refers to how closely the retrieved information matches the query. In information retrieval, this is typically measured using precision, which is the proportion of retrieved documents that are relevant, and recall, which is the proportion of relevant documents that are retrieved. To quantify these metrics, we use standard evaluation measures such as F1 score, which is the harmonic mean of precision and recall.
• Accuracy refers to the overall correctness of the retrieved information. In the biomedical domain, this could be measured by the percentage of correctly retrieved documents that are relevant to the given query. Relevance refers to the extent to which the retrieved documents are related to the given query. In information retrieval, this is typically measured using precision and recall. Precision is the proportion of retrieved documents that are relevant to the query, while recall is the proportion of relevant documents that are retrieved.
• Interpretability refers to the ease with which the results of the model can be understood and explained. In the context of information retrieval, this could include the ability to identify the specific factors that influenced the model’s decision-making process and to understand the relationships between the retrieved documents and the given query.
Overall, the compounded benefits of precision tuning in biomedical information retrieval include improved accuracy, better interpretability, and a reduction in entropy. These factors make precision tuning an important consideration in the development and fine-tuning of language models for use in specific use cases.