By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Dumping Data & Dodging Danger: A Quirky Quest Against Obfuscated Malware | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Dumping Data & Dodging Danger: A Quirky Quest Against Obfuscated Malware | HackerNoon
Computing

Dumping Data & Dodging Danger: A Quirky Quest Against Obfuscated Malware | HackerNoon

News Room
Last updated: 2025/02/24 at 5:27 PM
News Room Published 24 February 2025
Share
SHARE

Authors:

(1) S M Rakib Hasan, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh ([email protected]);

(2) Aakar Dhakal, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh ([email protected]).

Table of Links

Abstract and I. Introduction

II. Literature Review

III. Methodology

IV. Results and Discussion

V. Conclusion and Future Work, and References

III. METHODOLOGY

A. Dataset Description

The obfuscated malware dataset is a collection of memory dumps from benign and malicious processes, created to evaluate the performance of obfuscated malware detection methods. The dataset contains 58,596 records, with 50% benign and 50% malicious. The malicious memory dumps are from three categories of malware: Spyware, Ransomware, and Trojan Horse. The dataset is based on the work [10], who proposed a memory feature engineering approach for detecting obfuscated malware.

The dataset has 55 features and two types of labels, one denoting if they are malicious and the other denoting their malware family. Of the malicious samples, 32.5% are Trojan Horse malware, 33.67% are Spyware and 33.8% are Ransomware samples.

After collecting the data, we went through general preprocessing steps and conducted two sets of classification. One to detect malware and the other to classify the malware. Afterwards, we compiled the results to compare. Fig.1 demonstrates our workflow.

Fig. 1. Our work sequence.Fig. 1. Our work sequence.

B. Data Preprocessing

At first, we cleaned the data. Some labels had unnecessary information, so we had to trim them. We also standardized the data points for equal treatment of features and better convergence. The dataset was well maintained and clean, so we did not have to perform denoising or further preprocessing steps. For some classifications, we had to encode the labels or oversample or under-sample the data points.

C. Model Description

Here, we provide an in-depth analysis of the used machine learning models, discussing the underlying principles, loss functions, activation functions, mathematical functions, strengths, weaknesses, and limitations.

1) Random Forest: The Random Forest classifier is a powerful ensemble learning method that aims to improve predictive performance and reduce overfitting by aggregating multiple decision trees [11]. Each decision tree is built on a bootstrapped subset of the data, and features are sampled randomly to construct a diverse ensemble of trees. The final prediction is determined through a majority vote or weighted average of individual tree predictions.

Random Forest employs the Gini impurity as the default loss function for measuring the quality of splits in decision trees. It is defined as:

Where c is the number of classes, and pi is the probability of a sample being classified as class i.

Random Forest is robust against outliers and noisy data due to the ensemble nature that averages out individual tree errors. Also, it is less prone to overfitting compared to individual decision trees. Additionally, Random Forest provides an estimate of feature importance, aiding in feature selection and interpretation of results.

2) Multi-Layer Perceptron (MLP) Classifier: The MultiLayer Perceptron (MLP) classifier is a type of artificial neural network that consists of multiple layers of interconnected nodes or neurons [12] that process information and pass it to the subsequent layer using activation functions. The architecture typically includes an input layer, one or more hidden layers, and an output layer.

They commonly use the cross-entropy loss function for classification tasks which measures the dissimilarity between predicted probabilities and true class labels. Mathematically, it is defined as:

Where p(i) is the true probability distribution of class i, and q(i) is the predicted probability distribution.

They can capture complex patterns in data. The architecture’s flexibility allows it to model non-linear relationships effectively. With sufficient data and training, MLPs can achieve high predictive accuracy.

They require careful hyperparameter tuning to prevent overfitting, and their performance heavily depends on the choice of architecture and initialization.

3) k-Nearest Neighbors (KNN) Classifier: The k-Nearest Neighbors (KNN) classifier is a non-parametric algorithm used for both classification and regression tasks [13]. It makes predictions based on the majority class of the k nearest data points in the feature space.

KNN is not driven by explicit loss functions or optimization during training. Instead, during inference, it calculates distances between the query data point and all training data points using a distance metric such as Euclidean distance or Manhattan distance. The k nearest neighbours are then selected based on these distances. It can handle multi-class problems and adapt well to changes in data distribution. KNN does not assume any underlying data distribution, making it suitable for various data types. One limitation of KNN is its sensitivity to the choice of k value and distance metric.

4) XGBoost Classifier: The XGBoost (Extreme Gradient Boosting) classifier is an ensemble learning algorithm that has gained popularity for its performance and scalability in structured data problems [14]. It combines the power of gradient boosting with regularization techniques to create a robust and accurate model. It optimizes a loss function through a series of decision trees. The loss function is often a sum of a data-specific loss (e.g., squared error for regression or log loss for classification) and a regularization term. It minimizes this loss by iteratively adding decision trees and adjusting their weights. XGBoost excels in handling structured data, producing accurate predictions with fast training times. Its ensemble of shallow decision trees helps capture complex relationships in the data. Regularization techniques, such as L1 and L2 regularization, prevent overfitting and enhance generalization. However, it may not perform as well on text or image data, as it is primarily designed for structured data.

D. Binary Classification

In this step, our primary target was to detect a malicious memory dump. For this purpose, we have used some simple yet robust machine learning classifiers namely Random Forest Classifier, Multi-Level Perceptron and KNN and received outstanding results. Our trained models could detect all the malware accurately and showed great performance in the segregation of malware and safe memory dumps.

E. Malware Classification

Here, we tried to separate the malwares from each other based on their different features. Considering the benign samples, the data was highly imbalanced, so we had to go through some further preprocessing before training the detection system. We conducted the classification in three steps:

• Classifying in the original format: We ran the data through different ML models and stored the metric scores for comparison. We did not go through any further steps than our usual preprocessing. We passed the same data as we did in binary classification.

• Undersampling the majority class: As ”Benign” was the majority class in the malware dataset, we undersampled it through various methods, namely Edited Nearest Neighbor Rule [15], Near Miss Rule [16], Random Undersampling and All KNN Undersampling.

• Oversampling other minority Classes: We have used the ADASYN [17] method to generate synthetic data points for all minority classes.

After applying these techniques, our training data(80% of the total data) size is shown in TABLE I: Afterwards, we used different machine learning algorithms to train our detection models and recorded their performance.

TABLE ITRAINING SIZE OF THE ORIGINAL, UNDERSAMPLED AND OVERSAMPLED DATASET.TABLE ITRAINING SIZE OF THE ORIGINAL, UNDERSAMPLED AND OVERSAMPLED DATASET.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Facilitating Software Architecture with Andrew Harmel-Law
Next Article Kate Bush, Damon Albarn and more release ‘silent album’ to protest AI law change
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Powerbeats Pro 2 on Sale for Record Low Price of $199.95 This Weekend
News
Final weeks for Americans to get up to $1k in free cash from major bank
News
Sky flogs latest Apple iPhone for £19 a month – & deal gets you 30GB bonus data
News
9 Quadrillion Reasons Web3 Still Isn’t Ready | HackerNoon
Computing

You Might also Like

Computing

9 Quadrillion Reasons Web3 Still Isn’t Ready | HackerNoon

5 Min Read
Computing

Marketing via Storytelling: Is It a Dead Art? | HackerNoon

5 Min Read
Computing

Top 15 Change Management KPIs and Metrics to Track |

30 Min Read
Computing

Niri 25.05 Brings New Features To This Innovative Wayland Compositor

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?