Table of Links
Abstract and I. Introduction
II. Methods
III. Results
IV. Conclusion, Future Work, and References
II. METHODS
A. Datasets
To achieve high-quality results in training our framework utilizing a RandomForestClassifier and LLMs for classification and repair (Fig. 1), several essential features must be incorporated.
A source code column (“contract source”) is necessary to run Slither and the LLMs. However, since the datasets consistently excluded source code, a web scraping algorithm that employed the “contract address” column would be necessary to obtain source code from Etherscan and generation (see subsection D.). In order to account for source code that could not be scraped through Etherscan, the dataset (200,000 contracts) was reduced to 2500 rows.
Slither was then run on the newly acquired source code (see subsection B.), adding columns “vulnerability”, “confidence”, and “impact”. Slither occasionally failed to provide any vulnerabilities, totalling 474 failed contracts (80% successful output rate). To account for this, the dataset was reduced again to 2,000 smart contracts. Of the dataset, 400 were labeled malicious, and 1,600 were labeled non-malicious. Table I visualizes a segment of the finalized dataset.
B. Slither
Slither is a static code analyzer, which checks the smart contracts for vulnerabilities without executing the contract. Slither’s initial input comes from the Solidity Abstract Syntax Tree (AST) generated by the Solidity compiler from the contract source code. The smart contract is then simplified into an intermediate representation called SlithIR. This intermediate representation is compared to current industry standards, and Slither outputs vulnerabilities. Slither leads the industry in smart contract vulnerability detection, outperforming other static code analyzers in almost every metric, as shown in Table II. This, coupled with our Random Forest Classifier, ensures high accuracy in detecting vulnerable smart contracts.
After importing and running all 89 basic detectors provided by the API, we added each contract’s vulnerabilities to the dataset as a list of Slither’s natural language names with empty lists denoting contracts Slither deemed safe.
C. Data Issues and Generation
When it came to data collection, specific issues were encountered. Our biggest issue, extracting source code, proved to be a challenging task. For instance, in a dataset that bytecode was given, we were unsuccessful in decompiling that code into analyzable source code as we were unaware of the decompiler’s limits. We also struggled to find additional malicious source code to train a model on, as our dataset only included 150 malicious contracts. To overcome this, we implemented OpenAI’s GPT 3.5 Turbo to generate malicious source code. Initial attempts were barred by GPT 3.5’s ethical limitations (Fig. 2). However, after jailbreaking GPT 3.5 with prompt engineering [18], GPT 3.5 would produce malicious source code that could be repaired by the model.
The variability of the dataset made it difficult to generate Slither vulnerabilities for smart contracts, so a BLANK-step approach was used. The primary issue was the 100+ versions all contracts were written in combined with the limited backward compatibility of Solidity — i.e., version 0.4.11 could run on a compiler of version 0.4.26 but not a compiler of version 0.5.0+. Addressing this required modifying each contract to read ”pragma solidity ≥{version}”, creating five different scripts, and running each script on the entire dataset with one of five following Solidity versions: 0.4.26, 0.5.17,
0.6.12, 0.7.6, or 0.8.21, with Slither vulnerabilities of scripts that could not be compiled recorded as null, and those that could be recorded with the English name of the vulnerability, obtained from parsing the returned json. Combining these lists resulted in the final list of Slither vulnerabilities for the 75% of smart contracts for which this method yielded results.
Each detector class includes the detector’s confidence and impact levels. After creating a key-value pair of each detector’s English name and their confidence plus impact, this list was used to create confidence and impact lists for all vulnerabilities for each smart contract.
D. Classifier
Various models were implemented to classify smart contract maliciousness. Ultimately, RandomForestClassifier (RFC) provided the highest accuracy after pre-processing the finalized dataset.
RFC is unable to train on the dataset as provided by webscraping, generation, and Slither processing due to the abundance of unnecessary string-based features. So, unnecessary features are dropped, and necessary features are processed for RFC. For example, “confidence” and “vulnerability” retain a weaker correlation to “malicious” in comparison to “impact”, so to avoid convoluting the model, both are dropped. Thus, “contract source” and “impact” remain as the classifying features and “malicious” as the target label.
As all columns are still either string or boolean data types, RFC is still unable to train on the dataset. “contract source” was tokenized using the CountVectorizer (CV) tool from the sci-kit-learn library. “malicious” and “impact” were encoded into usable numeric values by mapping dictionaries. Since “impact” contained more than two possible outputs, unlike “malicious”, the outputs of “impact” were scaled from 0-4.
After the tokenized and encoded columns are concatenated, RFC’s numeric prerequisite is fulfilled.
The data is then split into a train-test split of 0.6-0.4 and randomized before RFC fits to the train set and predicts on the test set. Accuracy and confusion are evaluated in Results.
E. Large Language Models (LLMs)
1) Finetuning Llama-2-7B: We incorporated multiple Large Language Models to repair the smart contracts after they had been identified as malicious with our two-layered frameworks. The best results came from the Llama-2-7B model, which can be found on Hugging Face. This model finished training in July 2023. Our finetuning process took place about three weeks later. The Llama-2-7B model has become very popular due to its low number of parameters and reliability, leading to a less memory-intensive alternative to other LLMs in the industry.
The finetuning process took place on Google Colab using the T4 chip, which carries 16 GB of VRAM. However, Llama2-7B’s weights themselves fill this limit (7b * 2 bytes = 14). This also does not include any weights, optimizers, or gradients. Thus to run Llama-2-7B and be able to run it without memory restrictions on a platform like Google Colab, we will use parameter-efficient-finetuning (PEFT). Specifically, we will use QLoRa (Efficient Finetuning of Quantized LLMs), using 4-bit precision instead of the normal 16-bit precision. This quantization process allows for finetuning on Colab while also ensuring that the precision of the model is adequate. This is because when saving the 4-bit model, we also save the QLoRa adapters, which can be used with the model.
Moreover, Llama-2-7B is open source meaning the model is available to be downloaded and used locally. Traditional data privacy concerns with LLMs are therefore nullified because all data is processed on the local machine, not in a 3rd party server. This bodes well for smart contracts as many execute agreements with sensitive information and large sums of money. Llama-2-7B provides the benefits and accuracy of an advanced LLM while also providing the security and versatility neccesary for blockchain technology.
The Llama-2-7B model was fine-tuned on fifty smart contracts that were once malicious and then repaired, using a supervised learning approach. These smart contracts were collected in the data collection mentioned above. Specifically, the source code was tokenized and embedded, using the quantization outlined previously. The model was trained over 100 steps, with training loss consistently decreasing with every step(as shown in figure 3).
The supervised fine-tuning process allowed the model to understand the relationships between malicious source code and the same source code that had been repaired to emulate that with any other contract.
2) Prompt Engineering: We also utilized OpenAI’s API to use GPT-3.5-Turbo to repair vulnerabilities. OpenAI is one of the most well known names in the industry with applications such as DALL -E and ChatGPT. Specifically, while all GPT models are optimized to generate code, GPT-3.5-Turbo is the best combination of performance and efficiency. Moreover, by utilizing a ”chat bot”, we were able to use prompt engineering to create a prompt with the best possible performance. Directly querying GPT-3.5-Turbo to repair malicious code was unsuccessful. Similar to the generation of malicious smart contracts, GPT-3.5-Turbo had a reluctance to work with malicious source code (Fig. 4).
Thus prompt engineering was utilized to circumvent this problem.
First, the use of the word ”malicious” needed to be removed. While we were looking for our LLM to repair malicious smart contracts, GPT-3.5 Turbo was instead asked to help us “fix vulnerable smart contracts”.
We then used Chain of Thought Techniques in order for the model to elaborate on what changes it made and why. This led to a more accurate source code output and more vulnerabilities repaired. Additionally, this provided more information for the
user as the specific vulnerabilities in the malicious smart contract were highlighted and explained.
Ultimately, our prompt(Fig. 5) used Slither’s source code and vulnerabilities to prompt GPT 3.5 Turbo to repair the smart contracts. While Slither also outputs impact level and confidence on those vulnerabilities, we found incorporating these into the prompt hurt the model’s ability to output repaired source code or even source code that could be compiled. Essentially, using other Slither outputs led to overfitting. This prompt was also used with the Llama-2-7B model outlined above in order to create uniformity across outputs. In both models, the prompt allowed for the generation of repaired source code while also generating details that explained any changes and provided explanation.
In conclusion, we ended with two primary models to repair source code. First, the Llama-2-7B, which had been finetuned specifically for repairing smart contracts. Second was the utilization of GPT-3.5-Turbo which learned to repair smart contracts through CoT prompt engineering.
Authors:
(1) Abhinav Jain, Westborough High School, Westborough, MA and contributed equally to this work ([email protected]);
(2) Ehan Masud, Sunset High School, Portland, OR and contributed equally to this work ([email protected]);
(3) Michelle Han, Granite Bay High School, Granite Bay, CA ([email protected]);
(4) Rohan Dhillon, Lakeside School, Seattle, WA ([email protected]);
(5) Sumukh Rao, Bellarmine College Preparatory, San Jose, CA ([email protected]);
(6) Arya Joshi, Robbinsville High School, Robbinsville, NJ ([email protected]);
(7) Salar Cheema, University of Illinois, Champaign, IL ([email protected]);
(8) Saurav Kumar, University of Illinois, Champaign, IL ([email protected]).
This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.