Authors:
(1) Constantinos Patsakis, Department of Informatics, University of Piraeus, 80 Karaoli & Dimitriou str., 18534 Piraeus, Greece and Information Management Systems Institute of Athena Research Centre, Greece;
(2) Fran Casino, Information Management Systems Institute of Athena Research Centre, Greece and Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili;
(3) Nikolaos Lykousas, Data Centric, Romania.
Table of Links
Abstract and 1 Introduction
2 Related work
2.1 Malware analysis and countermeasures
2.2 LLMs in cybersecurity
3 Problem setting
4 Setting up the experiment and the dataset
5 Experimental results and discussion
6 Integration with existing pipelines
7 Conclusions, Acknowledgements, and References
Abstract
The integration of large language models (LLMs) into various pipelines is increasingly widespread, effectively automating many manual tasks and often surpassing human capabilities. Cybersecurity researchers and practitioners have recognised this potential. Thus, they are actively exploring its applications, given the vast volume of heterogeneous data that requires processing to identify anomalies, potential bypasses, attacks, and fraudulent incidents. On top of this, LLMs’ advanced capabilities in generating functional code, comprehending code context, and summarising its operations can also be leveraged for reverse engineering and malware deobfuscation. To this end, we delve into the deobfuscation capabilities of state-of-the-art LLMs. Beyond merely discussing a hypothetical scenario, we evaluate four LLMs with real-world malicious scripts used in the notorious Emotet malware campaign. Our results indicate that while not absolutely accurate yet, some LLMs can efficiently deobfuscate such payloads. Thus, fine-tuning LLMs for this task can be a viable potential for future AI-powered threat intelligence pipelines in the fight against obfuscated malware.
1 Introduction
While artificial intelligence and machine learning have long been cornerstones of computer science, it is only in recent years that we have fully harnessed their capabilities and translated them into practical applications, realising their true potential. This remarkable transformation cannot be solely attributed to the field’s maturity but rather a convergence of several enabling factors. One pivotal factor is the exponential growth of data generation, which has yielded vast datasets indispensable for training more sophisticated models. This abundance of data enables AI systems to learn and improve from a broader range and wider diversity of examples, leading to improved performance and generalisability. Simultaneously, significant advancements in computational power, particularly through the advent of GPUs and specialised hardware like Tensor Processing Units (TPUs), have dramatically reduced the time and resources needed to process complex algorithms and large datasets. This technological leap has made AI research more feasible and widespread, allowing researchers to experiment with more ambitious models and apply them to a broader spectrum of problems. Furthermore, the maturation and accessibility of machine learning frameworks and libraries have democratised AI, empowering a wider community of researchers, developers, and businesses to innovate and apply these transformative technologies across many domains. As a result, this technological democratisation has accelerated the adoption of AI, fostering a virtuous cycle of innovation, application, and investment, driving the rapid growth and evolution of machine learning and AI technologies.
In the last decade, machines have proved their capabilities compared with humans (e.g., by achieving human parity in several contexts [8, 22], and by being capable of processing vast amounts of data). Undeniably, the recent advent of Large Language Models (LLMs) such as GPT [27], BERT [7], and others has marked a massive milestone in AI and machine learning, fundamentally transforming how machines process, comprehend, and generate human language. LLMs have elevated natural language processing (NLP) capabilities, enabling more refined and context-aware text interpretations and producing coherent and contextually relevant responses, leveraging human-computer interactions. This advancement has broadened the applicability of AI across a spectrum of domains, encompassing translation, content creation, conversational agents, and more, paving the way for more natural and efficient human-computer interactions. Furthermore, LLMs have fuelled research into delving deeper into the intricacies of language representation and generation, pushing the boundaries of what is possible in machine comprehension and creativity. The scalability of these models allows them to continuously learn from an extensive array of sources, constantly improving and adapting to new information and contexts. This has not only led to the development of more powerful and versatile AI systems but has also ignited discussions about ethical considerations, potential biases, and the future impact of AI on society. In essence, the introduction of LLMs represents a transformative leap in the AI and ML landscape, significantly expanding both the potential applications and the social implications of these technologies.
Given their advanced capabilities in generating code, understanding the underlying functionality, and summarising it, it is natural to ponder whether they can be used for malware analysis in the case of reverse engineering malware. The reason is that modern malware is mostly packed to evade antivirus detection, bypass static signatures, and hinder its analysis by hiding the malicious payload [28, 31, 23]. Additionally, using obfuscators introduces several challenges to automated static analysis. For instance, variables in the obfuscated code are randomly named, and the added dead code contains random variables and strings. Nevertheless, malware authors may use domain generation algorithms to resolve the command and control server [4]. Therefore, a generic automated deobfuscator must understand the code to prune the added noise and distinguish whether a random-looking string is, e.g., a domain or just a distraction.
Malware analysts must combat these and other anti-analysis measures by reverse engineering malware. During this process, the analyst must convert the code from a binary into a human-readable form or extract the executable code from, e.g., a malicious document and analyse it. In both cases, the code would be obfuscated. There are modules for primary tools like Ida and Ghidra to communicate with GPT to explain functions and reduce the manual effort in their analysis. From this, malware analysts can create static and dynamic rules to detect the stains of the given sample and extract IOCs. It could also require further dynamic analysis with debuggers, execution in a sandbox, etc. These IOCs are essential for threat hunting and can facilitate takedowns and attribution. Nevertheless, the rules and extraction methods are heavily dependent on the sample. Note that although some generic deobfuscators can simplify this task a lot [32] by removing useless code, renaming variables, and reordering code, the automated extraction is challenging as they do not evaluate or understand code. As a result, when malware authors change their codebase and tools during a malware campaign, they render detection rules (e.g. Yara rules[1]), unpackers, extraction mechanisms, and CyberChef[2] recipes useless.
It is evident that the above requires a lot of manual effort, so every possible automation would be beneficial for the cybersecurity community. Nevertheless, this is an arms race between malware authors and defenders, where the former tries to bypass the established security mechanisms. Given the prevalence of obfuscation in modern malware and the quick changes malware authors make in their code and toolkits to adapt to the new detection rules, robust and adaptive deobfuscation methods are imperative. It is clear that current methods where deobfuscation methods are tailored for specific obfuscators and versions leave defenders one step behind. Nevertheless, the versatility of LLMs, their performance in code-related tasks, and their easiness of use in pipelines make them an excellent candidate for generic code deobfuscation. They can provide a versatile solution to address these shortcomings and facilitate the extraction of actionable threat intelligence from malware and other malicious scripts, e.g., webshells. LLMs could reinforce existing pipelines to extract the necessary intelligence, e.g., command and control servers from the configs when existing methods fail because they are very specific and rigid. Contribution: We explore the capacity of state-of-the-art LLMs in a realistic, well-defined and focused task related to malware analysis: deobfuscating malicious PowerShell scripts. While more constrained than deobfuscating a binary, it is more suitable for the maturity and input size of modern LLMs, easier to scale, and allows a fair and transparent comparison with ground truth. The latter also provides the means to extract important insights about LLMs’ current state, applicability, and efficacy in such tasks, fostering advancement in the field. Thus, our work explores in a systematic and practical way how LLMs can be used in the automation of cyber threat intelligence pipelines, focusing on intelligence from malware samples. Moreover, the deobfuscation is performed in data from a real-world malware campaign, namely Emotet, which at that time was characterised by Europol as “the most dangerous malware” [9]. Our results show that even without specific training, the results are very promising, so the use of LLMs for such tasks is expected to be very broad in the upcoming years. To the best of our knowledge, this is the first work to use LLMs is a broad and real-world dataset to deobfuscate malicious scripts and showcase how they can be used in cyber threat intelligence pipelines.
Road map: The rest of this work is structured as follows. In Section 2, we present the related work, and in Section 4, we detail our sec:experimental setup and our reference dataset. Next, in Section 5, we present the results of our experiments. Based on the above, in Section 6, we propose an augmented cyber threat intelligence pipeline for deobfuscating malicious payloads. Finally, the article concludes by summarising our findings and discussing ideas for future work.
This paper is available on arxiv under CC BY-NC-SA 4.0 by Deed (Attribution-Noncommercial-Sharealike 4.0 International) license.
[1] https://virustotal.github.io/yara/
[2] https://gchq.github.io/CyberChef/