Table of Links
- Abstract and Introduction
- Precision tuning for protein modeling
- QA Task Performance
- Results and References
Precision tuning for protein modeling
This study uses two datasets: (1) a biomedical information retrieval dataset of 480000 research papers on protein structure/function prediction and (2) a generic natural language understanding dataset. The biomedical dataset was used to fine-tune all models in this study, while the generic NLP dataset was used for testing the performance of the models. Research papers were collected from various online sources such as PubMed and Scopus and were preprocessed and cleaned to remove duplicates and irrelevant documents.
The collection process for the biomedical information retrieval dataset began with a comprehensive search of research papers related to protein structure/function prediction. The search was conducted using various databases and repositories, such as PubMed and Scopus, and keywords such as “protein structure,” “protein function,” and “protein prediction.” The resulting papers were then carefully curated and selected based on their relevance and quality. Only papers that were published in reputable journals and contained relevant information on protein structure/function prediction were included in the dataset. A total of 480000 papers were included in the final dataset, which was then used to fine-tune all models in this study.
Questions were formed on the basis of the specific use case in biomedical information retrieval, with the aim of testing the models’ performance on highly specific and targeted questions. Examples of such questions include “What is the function of protein X in the human body?” and “What is the structure of protein Y at atomic resolution?” These questions require a deep understanding of the subject matter and a high level of specificity in the retrieved information.
General question-answer training set examples:
Hyper-specific question-answer training set examples:
2.1 Fine-Tuning With Pytorch
To fine-tune the models, we used a variety of questions related to protein structure and function prediction. These questions ranged from highly specific queries, such as “What is the 3D structure of protein X?” to broader prompts, such as “What are the latest developments in protein structure prediction techniques?” as shown in Figures 1.1 and 1.2. The answers to these questions were gathered from the research papers in the biomedical dataset, with a focus on retrieving the most relevant and accurate information.
The AdamW optimizer is a variant of the Adam optimizer that includes weight decay regularization. It is defined by the following update rule:
All models were fine-tuned for a total of 10 epochs with a batch size of 32 using the Pytorch transformer framework The training and testing of the models in this study were performed on a server with 4 NVIDIA Tesla V100 GPUs and 64GB of RAM. The models were trained on the biomedical dataset in a supervised manner, with the questions and corresponding answers used as input and labels, respectively.
The fine-tuning process was carried out using a variety of techniques, including layer-wise learning rate decay, grouped layer-wise learning rate decay, and surgical fine-tuning. These techniques were chosen based on their effectiveness in adapting the models to the specific tasks of biomedical information retrieval. In addition, we also used techniques such as data augmentation and mix-up training to improve the generalization capabilities of the models.
2.1.1 Layer-Wise Learning Rate Decay
Layer-wise learning rate decay (LLRD) is a method of fine-tuning a neural network, specifically a Transformer model, in which different learning rates are applied to different layers in the model. This approach is based on the idea that different layers in a Transformer model often capture different types of information. For example, bottom layers may encode more general and broad-based information, while top layers closer to the output may encode more specific and task-specific information.
To implement LLRD, a learning rate is chosen for the top layer of the model and then a multiplicative decay rate is used to decrease the learning rate layer-by-layer from top to bottom. Alternatively, the layers can also be grouped and different learning rates applied to each group.
LLRD has been shown to be effective in improving the performance of Transformer models in various tasks, including natural language processing and information retrieval. It has been found to be particularly useful in situations where the target dataset is small or the distribution shift from the pre-trained model is significant.
To quantify the effectiveness of LLRD, common metrics such as accuracy, precision, and recall can be used. In the context of information retrieval, relevance of the retrieved information can also be evaluated using measures such as mean average precision (MAP) or normalized discounted cumulative gain (NDCG). These metrics can be calculated using the following equations:
2.1.2 Surgical Fine-Tuning
Surgical fine-tuning is a method that involves selectively fine-tuning a subset of layers in a pre-trained model. It is a form of transfer learning that aims to preserve learned features while adapting the model to the new task at hand. In the context of biomedical information retrieval, surgical fine-tuning allows us to tailor the model to the specific needs of the task, maximizing its performance and interpretability.
For this study, we used surgical fine-tuning to optimize the performance of DeepMind’s RETRO model (7B parameters), GPT-J (6B parameters), GPT-3 (175B parameters), and BLOOM (176B parameters) on the protein structure/function prediction dataset. We repeated the use of the AdamW optimizer with a linear learning rate scheduler, which linearly decreases the learning rate from its initial value to zero across training steps.
To determine the optimal subset of layers to fine-tune for each model, we conducted a series of experiments. We first fine-tuned all layers of each model and measured the performance on a set of evaluation tasks. We then fine-tuned only the top few layers and compared the results to the full fine-tuning approach. Through this process, we were able to identify the optimal subset of layers to fine-tune for each model, maximizing its performance on the protein structure/function Q&A task.
To perform surgical fine-tuning on a model, we first define the layers that we want to fine-tune. We separated our models into 5 layers and are fine-tuning layers 2 and 3. We can represent this as a binary mask, where 1 indicates that a layer should be fine-tuned and 0 indicates that it should not be fine-tuned. In this case, the binary mask would be [0, 1, 1, 0, 0].
Next, we need to calculate the optimal learning rate for each layer. To do this, we can use the following equation:
where i is the layer index, base_learning_rate is the base learning rate for all layers, data_size is the size of the dataset being used for fine-tuning, and parameters_i is the number of parameters in layer i.
Finally, we use these calculated learning rates to fine-tune the model. During training, we simply multiply the learning rate for each layer by the binary mask to determine the actual learning rate for that layer. As we are using a base learning rate of 0.001 and have a dataset of 1000 examples, the learning rates for each layer would be calculated as follows:
layer 0: 0.001 * sqrt(1000) / sqrt(100) = 0.01
layer 1: 0.001 * sqrt(1000) / sqrt(50) = 0.02
layer 2: 0.001 * sqrt(1000) / sqrt(75) = 0.016
layer 3: 0.001 * sqrt(1000) / sqrt(100) = 0.01
layer 4: 0.001 * sqrt(1000) / sqrt(125) = 0.008
Using the binary mask [0, 1, 1, 0, 0], the actual learning rates for each layer during training were:
layer 0: 0.01 * 0 = 0
layer 1: 0.02 * 1 = 0.02
layer 2: 0.016 * 1 = 0.016
layer 3: 0.01 * 0 = 0
layer 4: 0.008 * 0 = 0
Authors:
(1) Pranjali Awasthi;
(2) David Recio-Mitter;
(3) Yosuke Kyle Sugi.