Echoes In The Code: The Lasting Impact And Future Path Of AI Vulnerability Benchmarking

Table of Links

Abstract and I. Introduction

II. Related Work

III. Technical Background

IV. Systematic Security Vulnerability Discovery of Code Generation Models

V. Experiments

VI. Discussion

VII. Conclusion, Acknowledgments, and References

Appendix

A. Details of Code Language Models

B. Finding Security Vulnerabilities in GitHub Copilot

C. Other Baselines Using ChatGPT

D. Effect of Different Number of Few-shot Examples

E. Effectiveness in Generating Specific Vulnerabilities for C Codes

F. Security Vulnerability Results after Fuzzy Code Deduplication

G. Detailed Results of Transferability of the Generated Nonsecure Prompts

H. Details of Generating non-secure prompts Dataset

I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset

J. Effect of Sampling Temperature

K. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable Codes

L. Qualitative Examples Generated by CodeGen and ChatGPT

M. Qualitative Examples Generated by GitHub Copilot

VI. DISCUSSION

In contrast to manual methods, our approach can systematically find non-secure prompts that lead models to generate vulnerable codes and is therefore scalable for testing the models in generating new types of vulnerabilities. This allows extending our security benchmark with non-secure prompts using samples from specific CWEs and adding more types of vulnerabilities. By publishing the implementation of our approach and the generated non-secure prompts dataset, we also enable the community to contribute more CWEs and extend our dataset of promising non-secure prompts.

A. Transferability

In our evaluation, we have shown that the found non-secure prompts are transferable across different language models, meaning that non-secure prompts that we sample from one model will also generate a significant number of vulnerable codes containing the targeted CWE if used with another model. Specifically, we have found that, in most cases, nonsecure prompts sampled via ChatGPT can even find a higher fraction of vulnerabilities generated via CodeGen. Therefore, we publish a dataset of non-secure prompts, which can be used to benchmark the security of the black-box code generation models. Additionally, our dataset can be utilized in assessing both current and future methods, e.g., He and Vechev [58], that aims to improve the reliability of code models in generating secure code.

Our approach successfully finds non-secure prompts for different CWEs and program languages, and this can be extended without changing our general few-shot approach. Therefore, our benchmark can be augmented in the future with different kinds of vulnerabilities and code analysis techniques.

B. Limitations

While our approach provides a highly automated evaluation, it requires a set of vulnerable code samples to seed the approximated model inversion. Using known CVEs as prompts is impractical due to the human effort required for the extraction of the relevant parts into a standalone sample. The samples used herein are derived from various datasets (see Section IV-B), and they represent the respective CWEs in the most condensed way. However, this manual selection could introduce bias into the evaluation. We reduce its impact by using multiple samples per CWE from different sources.

Secondly, we rely on static analysis, namely CodeQL [46], to flag vulnerable code. It is a known limitation of these tools that they can only approximate but not guarantee accurate reports [59]. To limit the influence of false (negative or positive) reports on our ranking, we picked one of the best-performing freely available tools for the task [60]. In addition, the generated code that we test with CodeQL contains only a few functions. This minimizes the risk of incorrect reports while making the vulnerability detection objective, reproducible, and effortless.

VII. CONCLUSIONS

There have been tremendous advances in large-scale language models for code generation, and state-of-the-art models are now used by millions of programmers every day. Unfortunately, we do not yet fully understand the shortcomings and limitations of such models, especially with respect to insecure code generated by different models. Most importantly, we have lacked a method for systematically identifying prompts that lead to code with security vulnerabilities. In this paper, we have presented an automated approach to address this challenge. We approximated the black-box inversion of the target models based on few-shot prompting, which allows us to automatically find different sets of targeted vulnerabilities of the black-box code generation models. We proposed three different few-shot prompting strategies and used static analysis methods to check the generated code for potential security vulnerabilities.

We evaluated our method using the CodeGen and ChatGPT models. We showed that our method is capable of more than 2k vulnerable codes generated by these models. Furthermore, we introduce a non-secure prompts dataset designed for benchmarking code language models in generating vulnerable code. Using this public benchmark, we can measure the progress in terms of vulnerable code generated by large language models. Additionally, with our proposed method, we can flexibly expand this dataset to include newly discovered vulnerabilities and update it with additional sets of non-secure prompts.

ACKNOWLEDGEMENTS

This work was partially funded by ELSA – European Lighthouse on Secure and Safe AI funded by the European Union under grant agreement No. 101070617. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.

REFERENCES

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020.

[2] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv, 2022.

[3] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” arXiv, 2022.

[4] OpenAI, “Chatgpt: Optimizing language models for dialogue,” Nov. 2022, https://openai.com/blog/chatgpt/, as of October 24, 2023.

[5] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. W. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, I. Babuschkin, S. A. Balaji, S. Jain, A. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” arXiv, 2021.

[6] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv, 2022.

[7] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for code infilling and synthesis,” arXiv, 2022.

[8] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. [Online]. Available: https://www.science.org/doi/abs/10.1126/science.abq1158

[9] T. Dohmke, “Github copilot is generally available to all developers,” Jun. 2022, https://github.blog/ 2022-06-21-github-copilot-is-generally-available-to-all-developers/, as of October 24, 2023.

[10] S. Imai, “Is github copilot a substitute for human pair-programming? an empirical study,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 319–321.

[11] S. Zhao, “Github copilot is generally available for businesses,” Dec. 2022, https://github.blog/ 2022-12-07-github-copilot-is-generally-available-for-businesses/, as of October 24, 2023.

[12] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, ` J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv, 2023.

[13] S. Mouselinos, M. Malinowski, and H. Michalewski, “A simple, yet effective approach to finding biases in code generation,” arXiv preprint arXiv:2211.00609, 2022.

[14] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, ser. MAPS 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 1–10. [Online]. Available: https://doi.org/10.1145/3520312.3534862

[15] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,” in S and P, 2022.

[16] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2022, pp. 1–18.

[17] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu et al., “Exploring the limits of transfer learning with a unified text-to-text transformer.” JMLR, 2020.

[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.

[19] L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” arXiv, 2022.

[20] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in EMNLP, 2021.

[21] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in EMNLP, 2020.

[22] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. B. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pretraining code representations with data flow,” in ICLR, 2021.

[23] W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pretraining for program understanding and generation,” in NAACL, 2021.

[24] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv, 2023.

[25] H. Hajipour, N. Yu, C.-A. Staicu, and M. Fritz, “Simscood: Systematic analysis of out-of-distribution behavior of source code models,” arXiv, 2022.

[26] L. Szekeres, M. Payer, T. Wei, and D. Song, “SoK: Eternal War in Memory,” in IEEE Symposium on Security and Privacy, 2013.

[27] G. Sandoval, H. Pearce, T. Nys, R. Karri, B. Dolan-Gavitt, and S. Garg, “Security implications of large language model code assistants: A user study,” arXiv preprint arXiv:2208.09727, 2022.

[28] M. L. Siddiq and J. C. Santos, “Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques,” in MSR4P and S, 2022.

[29] MITRE, “CWE – Common Weakness Enumeration,” 2022, https://cwe. mitre.org, as of October 24, 2023. [30] M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. Santos, “An empirical study of code smells in transformer-based code generation techniques,” in SCAM, 2022.

[31] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in CVPR, 2021.

[32] H. Yin, P. Molchanov, J. Alvarez, Z. Li, A. Mallya, D. Hoiem, N. Jha, and J. Kautz, “Dreaming to distill: Data-free knowledge transfer via deepinversion,” in CVPR, 2020.

[33] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in ACM CCS, 2015.

[34] K.-C. Wang, Y. FU, K. Li, A. Khisti, R. Zemel, and A. Makhzani, “Variational model inversion attacks,” in NeurIPS, 2021.

[35] Y. Nakamura, S. Hanaoka, Y. Nomura, N. Hayashi, O. Abe, S. Yada, S. Wakamiya, and E. Aramaki, “Kart: Privacy leakage framework of language models pre-trained with clinical records,” arXiv, 2020.

[36] R. Zhang, S. Hidano, and F. Koushanfar, “Text revealer: Private text reconstruction via model inversion attacks against transformers,” arXiv, 2022.

[37] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data from Large Language Models,” in USENIX Security Symposium, 2021.

[38] M. Beller, R. Bholanath, S. McIntosh, and A. Zaidman, “Analyzing the state of static analysis: A large-scale evaluation in open source software,” in 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol. 1, 2016, pp. 470– 481.

[39] G. Chatzieleftheriou and P. Katsaros, “Test-driving static analysis tools in search of c code vulnerabilities,” in 2011 IEEE 35th Annual Computer Software and Applications Conference Workshops, 2011, pp. 96–103.

[40] M. Christakis and C. Bird, “What developers want and need from program analysis: An empirical study,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 332–343. [Online]. Available: https://doi.org/10.1145/2970276.2970347

[41] A. Gosain and G. Sharma, “Static analysis: A survey of techniques and tools,” in Intelligent Computing and Applications. Springer, 2015, pp. 581–591.

[42] K. Goseva-Popstojanova and A. Perhinschi, “On the capability of static code analysis to detect security vulnerabilities,” Information and Software Technology, vol. 68, pp. 18–33, 2015.

[43] N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix, “Using static analysis to find bugs,” IEEE Software, vol. 25, no. 5, pp. 22–29, 2008.

[44] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna, “SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis,” in IEEE Symposium on Security and Privacy (SP), 2016.

[45] A. Fioraldi, D. Maier, H. Eißfeldt, and M. Heuse, “AFL++ : Combining Incremental Steps of Fuzzing Research ,” in USENIX Workshop on Offensive Technologies (WOOT), 2020.

[46] G. Inc, “Github codeql,” 2022, https://codeql.github.com/, as of October 24, 2023.

[47] L. Wang, A. Schwing, and S. Lazebnik, “Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space,” in NeurIPS, 2017.

[48] A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” in CVPR, 2019.

[49] N. C. for Assured Software, “Juliet C/C++ 1.3,” Oct. 2017, https://samate. nist.gov/SARD/test-suites/112, as of October 24, 2023.

[50] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in ICLR, 2020.

[51] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity,” in ACL, May 2022.

[52] OpenAI, “OpenAI API Documentation,” 2022, https://beta.openai.com/ docs/introduction, as of October 24, 2023.

[53] P. Bareiß, B. Souza, M. d’Amorim, and M. Pradel, “Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code,” arXiv, 2022.

[54] OpenAI, “Gpt-4 technical report,” 2023.

[55] HuggingFace, “Big code models leaderboard,” Oct. 2023, https:// huggingface.co/spaces/bigcode/bigcode-models-leaderboard, as of October 24, 2023.

[56] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” arXiv, 2023.

[57] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv, 2023.

[58] J. He and M. Vechev, “Large language models for code: Security hardening and adversarial testing,” Workshop on Challenges in Deployable Generative AI at International Conference on Machine Learning (ICML), 2023.

[59] B. Chess and G. McGraw, “Static analysis for security,” IEEE security & privacy, vol. 2, no. 6, pp. 76–79, 2004.

[60] S. Lipp, S. Banescu, and A. Pretschner, “An empirical study on the effectiveness of static c code analyzers for vulnerability detection,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 544–555.

[61] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv, 2023.

[62] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wizardlm: Empowering large language models to follow complex instructions,” arXiv, 2023.

[63] P. Thakkar, “Copilot internals,” 2022, https://thakkarparth007.github.io/ copilot-explorer/posts/copilot-internals, as of October 24, 2023.

[64] SeatGeek, “Thefuzz,” 2022, https://github.com/seatgeek/thefuzz, as of October 24, 2023.

[65] L. Yujian and L. Bo, “A normalized levenshtein distance metric,” TPAMI, 2007.

Authors:

(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security ([email protected]);

(2) Keno Hassler, CISPA Helmholtz Center for Information Security ([email protected]);

(3) Thorsten Holz, CISPA Helmholtz Center for Information Security ([email protected]);

(4) Lea Schonherr, CISPA Helmholtz Center for Information Security ([email protected]);

(5) Mario Fritz, CISPA Helmholtz Center for Information Security ([email protected]).

Echoes in the Code: The Lasting Impact and Future Path of AI Vulnerability Benchmarking | HackerNoon

Table of Links

VI. DISCUSSION

A. Transferability

B. Limitations

VII. CONCLUSIONS

ACKNOWLEDGEMENTS

REFERENCES

Leave a Reply Cancel reply

Stay Connected

Latest News

Apple study: LLMs also benefit from an old productivity trick – 9to5Mac

Perplexity Will Share Revenue From AI Searches With Publishers

Stronger Signal, Smaller Price: Save 24% on TP-Link Deco BE63 Wi-Fi Mesh

11 Unconventional Ways to Use AI in Marketing | HackerNoon

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

VI. DISCUSSION

A. Transferability

B. Limitations

VII. CONCLUSIONS

ACKNOWLEDGEMENTS

REFERENCES

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News