By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Before AI Predicts Your Social Life, It Needs to Clean Its Data | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Before AI Predicts Your Social Life, It Needs to Clean Its Data | HackerNoon
Computing

Before AI Predicts Your Social Life, It Needs to Clean Its Data | HackerNoon

News Room
Last updated: 2025/02/12 at 2:19 PM
News Room Published 12 February 2025
Share
SHARE

In this post we’ll continue working on link prediction with the Twitch dataset.

We already have the graph data exported from Neptune using the neptune-export utility and the ‘neptune_ml’ profile. The previous steps are described in parts 2 and 1 of this guide.

Read part 1 here and part 2 here.

The data is currently stored in S3 and looks like this:

Vertices CSV (nodes/user.consolidated.csv):

~id,~label,days,mature,views,partner
"6980","user",771,true,2935,false
"547","user",2602,true,18099,false
"2173","user",1973,false,3939,false
...

Edges CSV (edges/user-follows-user.consolidated.csv):

~id,~label,~from,~to,~fromLabels,~toLabels
"3","follows","6194","2507","user","user"
"19","follows","3","3739","user","user"
"35","follows","6","2126","user","user"
...

The export utility also generated this config file for us:

training-data-configuration.json:

{
  "version" : "v2.0",
  "query_engine" : "gremlin",
  "graph" : {
    "nodes" : [ {
      "file_name" : "nodes/user.consolidated.csv",
      "separator" : ",",
      "node" : [ "~id", "user" ],
      "features" : [ {
        "feature" : [ "days", "days", "numerical" ],
        "norm" : "min-max",
        "imputer" : "median"
      }, {
        "feature" : [ "mature", "mature", "auto" ]
      }, {
        "feature" : [ "views", "views", "numerical" ],
        "norm" : "min-max",
        "imputer" : "median"
      }, {
        "feature" : [ "partner", "partner", "auto" ]
      } ]
    } ],
    "edges" : [ {
      "file_name" : "edges/%28user%29-follows-%28user%29.consolidated.csv",
      "separator" : ",",
      "source" : [ "~from", "user" ],
      "relation" : [ "", "follows" ],
      "dest" : [ "~to", "user" ],
      "features" : [ ]
    } ]
  },
  "warnings" : [ ]
}

Our current goal is to perform data processing, which means converting the data we have into a format that the Deep Graph Library framework can use for model training. (For an overview of link prediction with just DGL, see this post). That includes normalization of numerical features, encoding of categorical features, creating lists of node pairs with existing and non-existing links to enable supervised learning for our link prediction task, and splitting the data into training, validation and test sets.

As you can see in the training-data-configuration.json file, the node features ‘days’ (account age) and ‘views’ were recognized as numerical, and min-max normalization was suggested. Min-max normalization scales arbitrary values to a range of [0; 1] like this: x_normalized = (x – x_min) / (x_max – x_min). And imputer = median means that the missing values will be filled with the median value.

Node features ‘mature’ and ‘partner’ are labeled as ‘auto’, and as those columns contain only boolean values, we expect that they will be recognized as categorical features and encoded during the data processing stage. The train-validation-test split is not included in this automatically generated file, and the default split for the link prediction task is 0.9, 0.05, 0.05.

You can adjust the normalization and encoding settings, and you can choose a custom train-validation-test split. If you choose to do so, just replace the original training-data-configuration.json file in S3 with the updated version. The full list of supported fields in that JSON is available here. In this post, we’ll leave this file unchanged.

IAM ROLES NEEDED FOR DATA PROCESSING

Just like in the data loading stage (which is described in Part 1 of this tutorial), we need to create IAM roles that allow access to the services that we’ll be using, and we also need to add those roles to our Neptune cluster. We need two roles for the data processing stage. The first one is a Neptune role that provides Neptune access to SageMaker and S3. The second one is a SageMaker execution role that is used by SageMaker while running the data processing task and allows access S3.

These roles must have trust policies that allow Neptune and SageMaker services to assume them:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "rds.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

After creating the roles and updating their trust policies, we’ll add them to the Neptune cluster (Neptune -> Databases -> YOUR_NEPTUNE_CLUSTER_ID -> Connectivity & Security -> IAM Roles -> Add role).

DATA PROCESSING WITH NEPTUNE ML HTTP API

Now that we’ve updated the training-data-configuration.json file and added the IAM roles to the Neptune cluster, we’re ready to start the data processing job. To do that we need to send a request to the Neptune cluster’s HTTP API from inside the VPC where the cluster is located. We’ll use and EC2 instance to do that.

We’ll use curl to start the data processing job:

curl -XPOST https://(YOUR_NEPTUNE_ENDPOINT):8182/ml/dataprocessing 
  -H 'Content-Type: application/json' 
  -d '{
    "inputDataS3Location" : "s3://SOURCE_BUCKET/neptune-export/...",
    "processedDataS3Location" : "s3://OUTPUT_BUCKET/neptune-export-processed/...",
    "neptuneIamRoleArn": "arn:aws:iam::123456789012:role/NeptuneMLDataProcessingNeptuneRole",
    "sagemakerIamRoleArn": "arn:aws:iam::123456789012:role/NeptuneMLDataProcessingSagemakerRole"
  }'

Just these 4 parameters are required: input data S3 location, processed data S3 location, Neptune role, Sagemaker role. There are many optional parameters: for example, we can manually select the EC2 instance type that will be created for our data processing task with processingInstanceType and set its storage volume size with processingInstanceVolumeSizeInGB. The full list of parameters can be found here.

The cluster responds with a JSON that contains the ID of the data processing job that we just created:

{"id":"d584f5bc-d90e-4957-be01-523e07a7562e"}

We can use it to get the status of the job with this command (use the same neptuneIamRoleArn as in the previous request):

curl https://YOUR_NEPTUNE_CLUSTER_ENDPOINT:8182/ml/dataprocessing/YOUR_JOB_ID?neptuneIamRoleArn='arn:aws:iam::123456789012:role/NeptuneMLDataProcessingNeptuneRole'

Once it responds with something like this,

{
  "processingJob": {...},
  "id":"d584f5bc-d90e-4957-be01-523e07a7562e",
  "status":"Completed"
}

we can check the output. These files were created in the destination S3 bucket:

The graph.* files contains the processed graph data.

The features.json file contains the lists of node and edge features:

{
    "nodeProperties": {
        "user": [
            "days",
            "mature",
            "views",
            "partner"
        ]
    },
    "edgeProperties": {}
}

The details on how the data was processed and how the features were encoded can be found in the updated_training_config.json file:

{
    "graph": {
        "nodes": [
            {
                "file_name": "nodes/user.consolidated.csv",
                "separator": ",",
                "node": [
                    "~id",
                    "user"
                ],
                "features": [
                    {
                        "feature": [
                            "days",
                            "days",
                            "numerical"
                        ],
                        "norm": "min-max",
                        "imputer": "median"
                    },
                    {
                        "feature": [
                            "mature",
                            "mature",
                            "category"
                        ]
                    },
                    {
                        "feature": [
                            "views",
                            "views",
                            "numerical"
                        ],
                        "norm": "min-max",
                        "imputer": "median"
                    },
                    {
                        "feature": [
                            "partner",
                            "partner",
                            "category"
                        ]
                    }
                ]
            }
        ],
        "edges": [
            {
                "file_name": "edges/%28user%29-follows-%28user%29.consolidated.csv",
                "separator": ",",
                "source": [
                    "~from",
                    "user"
                ],
                "relation": [
                    "",
                    "follows"
                ],
                "dest": [
                    "~to",
                    "user"
                ]
            }
        ]
    }
}

We can see that the columns ‘mature’ and ‘partner’ with boolean values, initially labeled as ‘auto’ in the training-data-configuration.json file, were encoded as category features.

The ‘train_instance_recommendation.json’ file contains the SageMaker instance type and storage size recommended for model training:

{
  "instance": "ml.g4dn.2xlarge",
  "cpu_instance": "ml.m5.2xlarge",
  "disk_size": 14126462,
  "mem_size": 4349122131.111111
}

The model-hpo-configuration.json file contains the type of the model, the metrics used for its evaluation, the frequency of the evaluation, and the hyperparameters.

This concludes the data processing stage of the process, as we are now ready to start training the ML model. It will be discussed in the next part of this guide.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article I was a content moderator for Facebook. I saw the real cost of outsourcing digital labour | Sonia Kgomo
Next Article Apple TV Plus is reportedly about to debut on Android
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

This tool is a shortcut to smarter, faster work with AI — and it’s $100 for life
News
Green Chef Has the Tastiest Gluten-Free Recipes I’ve Made From a Meal Kit
Gadget
Where To Watch Why Women Kill for Free (Seasons 1 & 2) in 2025
News
Nvidia and Nutanix introduce the AI blueprint – News
News

You Might also Like

Computing

Future-Proof Your Security: Implementing Proactive Cyber Threat Intelligence | HackerNoon

10 Min Read
Computing

More Individual Lawsuits Equals More Decentralized AI | HackerNoon

7 Min Read
Computing

Digital Defenders: Meet Syed Shahzaib Shah, Pakistan’s Ethical Hacker Changing the Game | HackerNoon

6 Min Read
Computing

Everything I heard at the AVCA Conference |

9 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?