Table of Links
Abstract and I. Introduction
II. Related Work
III. Technical Background
IV. Systematic Security Vulnerability Discovery of Code Generation Models
V. Experiments
VI. Discussion
VII. Conclusion, Acknowledgments, and References
Appendix
A. Details of Code Language Models
B. Finding Security Vulnerabilities in GitHub Copilot
C. Other Baselines Using ChatGPT
D. Effect of Different Number of Few-shot Examples
E. Effectiveness in Generating Specific Vulnerabilities for C Codes
F. Security Vulnerability Results after Fuzzy Code Deduplication
G. Detailed Results of Transferability of the Generated Nonsecure Prompts
H. Details of Generating non-secure prompts Dataset
I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset
J. Effect of Sampling Temperature
K. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable Codes
L. Qualitative Examples Generated by CodeGen and ChatGPT
M. Qualitative Examples Generated by GitHub Copilot
In the following, we briefly introduce existing work on large language models and discuss how this work relates to our approach.
A. Large Language Models and Prompting
Large language models have advanced the natural language processing field in various tasks, including question answering, translation, and reading comprehension [1], [17]. These milestones were achieved by scaling the model size from hundreds of millions [18] to hundreds of billions [1], self-supervised objective functions, reinforcement learning from human feedback [19], and huge corpora of text data. Many of these models are trained by large companies and then released as pre-trained models. Brown et al. [1] show that these models can be used to tackle a variety of tasks by providing only a few examples as input – without any changes in the parameters of the models. The end user can use a template as a few-shot prompt to guide the models to generate the desired output for a specific task. In this work, we show how a few-shot prompting approach can be used to generate code with specific vulnerabilities by approximating the inversion of the black-box code generation models.
B. Large Language Models of Source Codes
There is a growing interest in using large language models for source code understanding and generation tasks [7], [5], [20]. Feng et al. [21] and Guo et al. [22] propose encoder-only models with a variant of objective functions. These models [21], [22] primarily focus on code classification, code retrieval, and program repair. Ahmad et al. [23] and Wang et al. [20] employ encoder-decoder architecture to tackle code-to-code, and code-to-text generation tasks, including program translation, program repair, and code summarization. Recently, decoderonly models have shown promising results in generating programs in left-to-right fashion [5], [4], [6], [12]. These models can be applied to zero-shot and few-shot program generation tasks [5], [6], [24], [12], including code completion, code infilling, and text-to-code tasks. Large language models of code have mainly been evaluated based on the functional correctness of the generated codes without considering potential security vulnerability issues (see Section II-C for a discussion). In this work, we propose an approach to automatically find specific security vulnerabilities that can be generated by these models through the approximation of the inversion of target black-box models via few-shot prompting.
C. Security Vulnerability Issues of Code Generation Models
Large language code generation models have been pre-trained using vast corpora of open-source code data [7], [5], [25]. These open-source codes can contain a variety of different security vulnerability issues, including memory safety violations [26], deprecated API and algorithms (e.g., MD5 hash algorithm [27], [15]), or SQL injection and cross-site scripting [28], [15] vulnerabilities. Large language models can learn these security patterns and potentially generate vulnerable codes given the users’ inputs. Recently, Pearce et al. [15] and Siddiq and Santos [28] show that the generated codes using code generation models can contain various security issues.
Pearce et al. [15] use a set of manually-designed scenarios to investigate potential security vulnerability issues of GitHub Copilot [9]. These scenarios are curated by using a limited set of vulnerable codes. Each scenario contains the first few lines of the potentially vulnerable codes, and the models are queried to complete the scenarios. These scenarios were designed based on MITRE’s Common Weakness Enumeration (CWE) [29]. Pearce et al. [15] evaluate the generated codes’ vulnerabilities by employing the GitHub CodeQL static analysis tool. Previous studies [15], [30], [28] examined security issues in code generation models, but they relied on a limited set of manually-designed scenarios, which could result in missing generating potential codes with certain vulnerability types. In contrast, our work proposes a systematic approach to finding security vulnerabilities by automatically generating various scenarios at scale. This enables us to create a diverse set of non-secure prompts for assessing and comparing the models with respect to generating code with security issues.
D. Model Inversion and Data Extraction
Deep model inversion has been applied to model explanation [31], model distillation [32], and more commonly to reconstruct private training data [33], [34], [35], [36]. The general goal in model inversion is to reconstruct a representative view of the input data based on the models’ outputs [34]. Recently, Carlini et al. [37] showed that it is possible to extract memorized data from large language models. These data include personal information such as e-mail, URLs, and phone numbers. In this work, we use few-shot prompting to approximate an inversion of the targeted black-box code models. Here, our goal is to employ the approximated inversion of the models to automatically find the scenarios (prompts) that lead the models to generate codes with a specific type of vulnerability.
III. TECHNICAL BACKGROUND
Detecting software bugs before deployment can prevent potential harm and unforeseeable costs. However, automatically finding security-critical bugs in code is a challenging task in practice. This also includes model-generated code, especially given the black-box nature and complexity of such models. In the following, we elaborate on recent analysis methods and classification schemes for code vulnerabilities.
A. Evaluating Security Issues
Various security testing methods can be used to find software vulnerabilities to avoid bugs during the run-time of a deployed system [38], [39], [40]. To achieve this goal, these methods attempt to detect different kinds of programming errors, poor coding style, deprecated functionalities, or potential memory safety violations (e.g., unauthorized access to unsafe memory that can be exploited after deployment or obsolete cryptographic schemes that are insecure [41], [42], [26]). Broadly speaking, current methods for security evaluation of software can be
divided into two categories: static [38], [43] and dynamic analysis [44], [45]. While static analysis evaluates the code of a given program to find potential vulnerabilities, the latter approach executes the codes. For example, fuzz testing (fuzzing) generates random program executions to trigger the bugs.
For the purpose of our work, we choose to use static analysis to evaluate the generated code, as it enables us to classify the type of detected vulnerabilities. Specifically, we use CodeQL, one of the best-performing free static analysis engines released by GitHub [46]. For analyzing the language model generated code, we query the code via CodeQL to find security vulnerabilities in the code. We use CodeQL’s CWE classification output to categorize the type of vulnerability that has been found during our evaluation and to define a set of vulnerabilities that we further investigate throughout this work.
B. Classification of Security Weaknesses
Common Weaknesses Enumerations (CWEs) is a list of typical flaws in software and hardware provided by MITRE [29], often with specific vulnerability examples. In total, more than 400 different CWE types are defined and categorized into different classes and variants, e. g. memory corruption errors. Listing 1 shows an example of CWE-502 (Deserialization of Untrusted Data) in Python. In this example from [29], the Pickle library is used to deserialize data: The code parses data and tries to authenticate a user based on validating a token, but without verifying the incoming data. A potential attacker can construct a pickle, which spawns new processes, and since Pickle allows objects to define the process for how they should be unpickled, the attacker can direct the unpickle process to call the subprocess module and execute /bin/sh.
For our work, we focus on the analysis of thirteen representative CWEs that can be detected via static analysis tools to show that we can systematically generate vulnerable code and their input prompts. We decided not to use fuzzing for vulnerability detection due to the potentially high computational cost and manual effort imposed by root cause analysis. Some CWEs represent mere code smells or require considering the development and deployment process and are hence out of scope for this work. The thirteen analyzed CWEs, including a brief description, are listed in Table I. Of the thirteen listed CWEs, eleven are from the top 25 list of the most important vulnerabilities. The description is defined by MITRE [29].
Authors:
(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security ([email protected]);
(2) Keno Hassler, CISPA Helmholtz Center for Information Security ([email protected]);
(3) Thorsten Holz, CISPA Helmholtz Center for Information Security ([email protected]);
(4) Lea Schonherr, CISPA Helmholtz Center for Information Security ([email protected]);
(5) Mario Fritz, CISPA Helmholtz Center for Information Security ([email protected]).