Authors:
(1) S M Rakib Hasan, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh ([email protected]);
(2) Aakar Dhakal, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh ([email protected]).
Table of Links
Abstract and I. Introduction
II. Literature Review
III. Methodology
IV. Results and Discussion
V. Conclusion and Future Work, and References
IV. RESULTS AND DISCUSSION
From our experiments, we have achieved outstanding results on our malware detection system.
A. Binary Classification
Our trained model achieved 99.99% accuracy on the test set, detecting all the malware correctly. The result is shown in Fig.II However, all the models performed very well in the detection of potential malware. The results are tabulated in TABLE II
B. Malware Classification
As the dataset is highly imbalanced, we conducted this part in 3 steps. First, we conducted the experiment on the original dataset, then undersampled the majority class and later oversampled the minority classes.
1) Classification on Original Data: Here, we ran the untouched data through our chosen algorithms and achieved moderate results. Although the metrics are not as impressive as the binary classification, it is mentionable that, no malware was classified safe, rather, different malwares were classified wrong. Our result is tabulated in TABLE III. From the results, it is seen that the XGBoost classifier performed the best in the detection and classification of malware.
2) Undersampling Majority Class: We have used four types of undersampling methods and trained our models on all of them. We got different performance metrics for different undersampling methods. No single method could dominate the scores. However, Random Undersampling and Near Miss approaches performed better than the other two methods. These results are tabulated in TABLE IV. From the results, we can see, that the XGBoost Classifier also performed better in this case while the Random Forest Classifier was really close. In this approach too, no malware was labeled safe during detection.
3) Oversampling Minority Classes: Among the popular oversampling methods, we choose ADASYN(Adaptive Synthetic Sampling). It is a data augmentation technique primarily used in imbalanced classification tasks. After applying ADASYN to all the minority classes separately, we balanced the dataset and applied our chosen classification algorithms. We got our best results with this approach. The findings are tabulated in TABLE V
Here also, XGBoost outperformed the other classifiers and provided the best predictions. The detection is shown in the Fig.3
Therefore, we see that our malware detection models are well-performing and robust. It can perfectly detect any potential malware through memory dump analysis as we conduct binary classification. In classifying the malware, among the explored approaches, the application of ADASYN emerged as the most promising solution. By systematically addressing the class imbalance through synthetic data generation, we achieved superior results compared to both the original format classification and the undersampling techniques. The outcomes of our experiments underscore the importance of tailored
strategies for handling class imbalance and reaffirm the potential of advanced techniques like ADASYN in enhancing multiclass classification accuracy.
V. CONCLUSION AND FUTURE WORK
In conclusion, our research addresses the rising threat of obfuscated malware in connected devices and the internet landscape. Through memory dump analysis and diverse machine learning algorithms, we’ve explored effective detection strategies and illuminated their strengths and limitations using the CIC-MalMem-2022 dataset. Emphasizing the synergy between machine learning and traditional security methods, our work underscores the need for a comprehensive defense strategy in the dynamic cybersecurity realm. While acknowledging the ever-evolving malware landscape, our research lays the groundwork for future endeavours, advocating continuous adaptation. Future efforts should focus on refining algorithms, exploring new data sources, and fostering interdisciplinary collaboration. We envision research on hybrid approaches, combining machine learning and signature-based methods, and studying the impact of adversarial attacks and explainable AI to enhance detection system robustness and transparency. In summary, our study provides valuable insights for resilient cybersecurity solutions, addressing the challenges of obfuscated malware and advancing detection capabilities to safeguard digital ecosystems against emerging threats.
REFERENCES
[1] Z. Chen, E. Brophy, and T. Ward, “Malware classification using static disassembly and machine learning,” arXiv preprint arXiv:2201.07649, 2021.
[2] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel feature extraction, selection and fusion for effective malware family classification,” in Proceedings of the sixth ACM conference on data and application security and privacy, 2016, pp. 183–194.
[3] I. You and K. Yim, “Malware obfuscation techniques: A brief survey,” in 2010 International conference on broadband, wireless computing, communication and applications. IEEE, 2010, pp. 297–300.
[4] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A multimodal deep learning method for android malware detection using various features,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 3, pp. 773–788, 2018.
[5] A. Bacci, A. Bartoli, F. Martinelli, E. Medvet, F. Mercaldo, C. A. Visaggio et al., “Impact of code obfuscation on android malware detection based on static and dynamic analysis.” in ICISSP, 2018, pp. 379–385.
[6] O. A. Aslan and R. Samet, “A comprehensive review on malware ¨ detection approaches,” IEEE access, vol. 8, pp. 6249–6271, 2020.
[7] G. Wagener, R. State, and A. Dulaunoy, “Malware behaviour analysis,” Journal in Computer Virology, vol. 4, pp. 279–287, 11 2008.
[8] Y. Fukushima, A. Sakai, Y. Hori, and K. Sakurai, “A behavior based malware detection scheme for avoiding false positive,” 11 2010, pp. 79 – 84.
[9] M. Chandramohan, H. B. K. Tan, L. C. Briand, L. K. Shar, and B. M. Padmanabhuni, “A scalable approach for malware detection through bounded feature space behavior modeling,” in Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, November 2013, pp. 312– 322.
[10] T. Carrier, P. Victor, A. Tekeoglu, and A. H. Lashkari, “Detecting obfuscated malware using memory feature engineering,” in The 8th International Conference on Information Systems Security and Privacy (ICISSP), 2022.
[11] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[12] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[13] T. M. Cover and P. E. Hart, “Nearest-neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
[14] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2016, pp. 785–794.
[15] R. Alejo, J. M. Sotoca, R. M. Valdovinos, and P. Toribio, “Edited nearest neighbor rule for improving neural networks classifications,” in Advances in Neural Networks – ISNN 2010, L. Zhang, B.-L. Lu, and J. Kwok, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 303–310.
[16] C. Jiang, J. Song, G. Liu, L. Zheng, and W. Luan, “Credit card fraud detection: A novel approach using aggregation strategy and feedback mechanism,” IEEE Internet of Things Journal, pp. 1–1, 2018.
[17] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 1322–1328.