In this post we’ll continue working on link prediction with the Twitch dataset. We’ll focus on training the ML model based on Graph Neural Networks (GNNs) and optimizing its hyperparameters. By now we already have the graph data processed and prepared for model training. The previous steps are described in Part 3 – Data processing, Part 2 – Exporting data from the DB and Part 1 – Loading data into the DB.
Read part 1 here; part 2 here; and part 3 here.
CHOOSING GRAPH NEURAL NETWORK TYPE: GCNs and R-GCNs
We’ll use Graph Convolutional Neural Networks just like we did in the local link prediction post, and although Neptune ML uses the same DGL.ai framework, the underlying model is a bit different. Neptune ML supports both knowledge graphs (homogeneous graphs with a single node type and a single edge type) and heterogeneous graphs that have multiple node and edge types. The dataset that we’re working with has a single node type (user) and a single edge type (friendship). Although a Graph Convolutional Network (GCN) or a Graph Sample and Aggregation (GraphSAGE) model would also work in this case, Neptune ML automatically chooses a Relational Graph Convolutional Network (R-GCN) model for datasets with node properties that may vary from node to node, as explained here. In general, R-GCNs require more compute to train due to the increased number of parameters needed to handle multiple node and edge types.
MODEL TRAINING HYPERPARAMETERS
During the data processing stage (that we described in the previous post TODO LINK), Neptune ML created a file named model-hpo-configuration.json
. It contains the model type (R-GCN), the task type (link prediction), the evaluation metric and frequency, and 4 lists of parameters: one with fixed parameters that are not being modified during training, and 3 lists of parameters to be optimized, with ranges and default values. The parameters are grouped by importance. Whether parameters from each of the groups are tuned is decided based on the number of available tuning jobs: the 1st tier parameters are always tuned, the 2nd tier parameters are tuned if the number of available jobs is > 10, and the 3rd tier parameters are tuned only if it’s > 50. Our model-hpo-configuration.json
file looks like this:
{
"models": [
{
"model": "rgcn",
"task_type": "link_predict",
"eval_metric": { "metric": "mrr", "global_ranking_metrics": true, "include_retrieval_metrics": false },
"eval_frequency": { "type": "evaluate_every_pct", "value": 0.05 },
"1-tier-param": [
{ "param": "num-hidden", "range": [16, 128], "type": "int", "inc_strategy": "power2" },
{ "param": "num-epochs", "range": [3, 100], "inc_strategy": "linear", "inc_val": 1, "type": "int", "edge_strategy": "perM" },
{ "param": "lr", "range": [0.001, 0.01], "type": "float", "inc_strategy": "log" },
{ "param": "num-negs", "range": [4, 32], "type": "int", "inc_strategy": "power2" }
],
"2-tier-param": [
{ "param": "dropout", "range": [0.0, 0.5], "inc_strategy": "linear", "type": "float", "default": 0.3 },
{ "param": "layer-norm", "type": "bool", "default": true },
{ "param": "regularization-coef", "range": [0.0001, 0.01], "type": "float", "inc_strategy": "log", "default": 0.001 }
],
"3-tier-param": [
{ "param": "batch-size", "range": [128, 512], "inc_strategy": "power2", "type": "int", "default": 256 },
{ "param": "sparse-lr", "range": [0.001, 0.01], "inc_strategy": "log", "type": "float", "default": 0.001 },
{ "param": "fanout", "type": "int", "options": [[10, 30], [15, 30], [15, 30]], "default": [10, 15, 15] },
{ "param": "num-layer", "range": [1, 3], "inc_strategy": "linear", "inc_val": 1, "type": "int", "default": 2 },
{ "param": "num-bases", "range": [0, 8], "inc_strategy": "linear", "inc_val": 2, "type": "int", "default": 0 }
],
"fixed-param": [
{ "param": "neg-share", "type": "bool", "default": true },
{ "param": "use-self-loop", "type": "bool", "default": true },
{ "param": "low-mem", "type": "bool", "default": true },
{ "param": "enable-early-stop", "type": "bool", "default": true },
{ "param": "window-for-early-stop", "type": "bool", "default": 3 },
{ "param": "concat-node-embed", "type": "bool", "default": true },
{ "param": "per-feat-name-embed", "type": "bool", "default": true },
{ "param": "use-edge-features", "type": "bool", "default": false },
{ "param": "edge-num-hidden", "type": "int", "default": 16 },
{ "param": "weighted-link-prediction", "type": "bool", "default": false },
{ "param": "link-prediction-remove-targets", "type": "bool", "default": false },
{ "param": "l2norm", "type": "float", "default": 0 }
]
}
]
}
The model & task type parameters were set during the data export and processing stages, and should not be changed here.
The evaluation metric was automatically chosen too. Mean reciprocal rank (MRR) measures the average rank of the correct link in the predicted results, with a higher MRR indicating better performance.
Evaluation frequency is set to 5% of the training progress. For example, if we have 100 epochs, evaluation will be performed every 5 epochs.
Let’s review some of the hyperparameters that will be tuned:
lr: Learning rate is one of the most impactful hyperparameters for any model training. A lower learning rate may lead to slower convergence but potentially better performance, while a higher learning rate can speed up training but might miss out on optimal solutions.
num-hidden: The num-hidden parameter refers to the number of hidden units (neurons) in each layer of the R-GCN neural network, specifically in the hidden layers. A larger number of hidden units increases the model’s capacity to learn complex patterns and relationships from the data, which can improve prediction accuracy, but may also lead to overfitting if the model becomes too complex for the dataset.
num-epochs: This defines how long the model is trained for. More epochs allow the model to learn more from the data but may increase the risk of overfitting.
batch-size: The batch size affects memory usage and convergence stability. A smaller batch size might make the model more sensitive to the data, while a larger batch size may improve training speed.
num-negs: Negative sampling affects how the model learns to distinguish true links from false ones. A higher number of negative samples may improve the quality of the predictions but it increases computational costs.
dropout: Dropout helps to prevent overfitting by randomly skipping some neurons during training. A higher dropout rate may reduce overfitting but it could make learning harder for the model.
regularization-coef: Regularization that is aimed to prevent the model from overfitting.
You can change the default values, range and step size for each of these parameters. The full list of parameters can be found here.
After changing the parameters, just replace the original model-hpo-configuration.json
file in S3.
IAM ROLES FOR MODEL TRAINING AND HPO
Just like data processing described in Part 3 of this guide, model training requires 2 IAM roles: a Neptune role that provides Neptune access to SageMaker and S3, and a Sagemaker execution role that is used by SageMaker while running the data processing task and allows it to access S3. These roles must have trust policies that allow Neptune and SageMaker services to assume them:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "rds.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
After creating the roles and updating their trust policies, we add them to the Neptune cluster (Neptune -> Databases -> YOUR_NEPTUNE_CLUSTER_ID -> Connectivity & Security -> IAM Roles -> Add role).
STARTING MODEL TRAINING AND HPO USING NEPTUNE ML API
Now we’re ready to start model training. To do that, we need to send a request to the Neptune cluster’s HTTP API from inside the VPC where the cluster is located. We’ll use curl on an EC2 instance:
curl -XPOST https://(YOUR_NEPTUNE_ENDPOINT):8182/ml/modeltraining
-H 'Content-Type: application/json'
-d '{
"dataProcessingJobId" : "ID_OF_YOUR_DATAPROCESSING_JOB",
"trainModelS3Location" : "s3://OUTPUT_BUCKET/model-artifacts/...",
"neptuneIamRoleArn": "arn:aws:iam::123456789012:role/NeptuneMLModelTrainingNeptuneRole",
"sagemakerIamRoleArn": "arn:aws:iam::123456789012:role/NeptuneMLModelTrainingSagemakerRole"
}'
Only these parameters are required:
- dataProcessingJobId – the job id will be used to get the location of the processed data in S3)
- trainModelS3Location – the output location for the artifacts (weights of the model)
- Neptune and SageMaker roles (these roles must be added to the Neptune DB cluster)
There’s also the maxHPONumberOfTrainingJobs parameter that sets the number of training jobs to run with different sets of hyperparameters. By default, it’s 2, but AWS recommends running at least 10 jobs to get an accurate model.
There are many optional parameters as well: for example, we can manually select the EC2 instance type that will be used for model training with trainingInstanceType and set its storage volume size with trainingInstanceVolumeSizeInGB. The full list of parameters can be found here.
The cluster responds with a JSON that contains the ID of the data processing job that we just created:
{"id":"d584f5bc-d90e-4957-be01-523e07a7562e"}
We can use it to get the status of the model training job with this command (use the same neptuneIamRoleArn as in the previous request):
curl https://YOUR_NEPTUNE_CLUSTER_ENDPOINT:8182/ml/modeltraining/YOUR_JOB_ID?neptuneIamRoleArn='arn:aws:iam::123456789012:role/NeptuneMLModelTrainingNeptuneRole'
Once it responds with something like this,
{
"processingJob": {
"name": "PROCESSING_JOB_NAME",
"arn": "arn:aws:sagemaker:us-east-1:123456789012:processing-job/YOUR_PROCESSING_JOB_NAME",
"status": "Completed",
"outputLocation": "s3://OUTPUT_BUCKET/model-artifacts/PROCESSING_JOB_NAME/autotrainer-output"
},
"hpoJob": {
"name": "HPO_JOB_NAME",
"arn": "arn:aws:sagemaker:us-east-1:123456789012:hyper-parameter-tuning-job/HPO_JOB_NAME",
"status": "Completed"
},
"mlModels": [
{
"name": "MODEL_NAME-cpu",
"arn": "arn:aws:sagemaker:us-east-1:123456789012:model/MODEL_NAME-cpu"
}
],
"id": "d584f5bc-d90e-4957-be01-523e07a7562e",
"status": "Completed"
}
we can check the training logs and the artifacts in the destination S3 bucket.
REVIEWING MODEL TRAINING RESULTS
Model training has completed, so let’s check the results in the AWS console: SageMaker -> Training -> Training Jobs.
For simplicity, we didn’t change the number of HPO jobs when we started model training, and the default value of 2 was used. The 2 jobs were run in parallel. The instance type was selected automatically: ml.g4dn.2xlarge.
The first job (the one with ‘001’ in its name) completed in 15 minutes, and the second one (‘002’) was automatically stopped, as SageMaker supports early stopping if the training metrics do not improve for a while:
Let’s compare the hyperparameters that were used in these jobs:
Only 3 parameters have different values: num-hidden, num-negs and lr. The second model (trained with Job 2) had higher learning rate while having less capacity to capture complex patterns (because it had fewer neurons), and it was trained on fewer negative samples. That led to significantly lower accuracy as we can see from the Validation Mean Rank (115 vs 23) and HITS@K:
Mean Rank (MR) is the average rank position of the correct link among the predictions. Lower MR values are better because they indicate that the correct link is, on average, ranked closer to the top.
The HITS@K metrics measure the proportion of times the correct link appears in the top K predicted results.
MODEL ARTIFACTS
When the training jobs are done, model artifacts are created in the output S3 bucket, along with files that contain training stats and metrics:
The metrics and parameters in these JSON files are the ones that we mentioned earlier. Only the 001 directory contains the ‘output’ subdirectory with the model.tar.gz file, as it’s the only HPO job that was completed. Artifacts for link prediction also contain the DGL graph data as it’s required to make actual predictions, as explained here.
These files will be used to create an inference endpoint and generate actual link predictions. That will be discussed in the next and final post of this series.