Abstract
Email is a vital tool for communication in today’s world; however, spam emails have emerged as a major challenge. These unsolicited messages from unknown sources often fill inboxes, disrupting communication and productivity. This paper investigates the use of various machine learning techniques to classify emails as either “spam” or “ham” (non-spam).
Classification models such as K-Nearest Neighbors (KNN), Logistic Regression, Support Vector Machines (SVM), and Naïve Bayes are evaluated, comparing their effectiveness in email classification. The performance of each model is evaluated based on metrics like accuracy, precision, recall, and F1 score to determine which approach is most suitable for this task.
Introduction
In 2023, there are 347.3 billion emails sent every day, out of which spam emails compose 45% of all email traffic. These email spam costs businesses $20.5 billion every year. Given this, there will always be a need to ensure spam is correctly classified and does not interfere with legitimate email traffic. This is a useful way to understand the potential and application of machine learning.
In the case of spam detection, a human user familiar with email can easily determine spam almost immediately upon looking at it. As a result, I believe this work confirms that identifying spam is a useful application for a machine learning classifier. To classify an email as spam or ham, there are a lot of preprocessing steps involved – preparing the data to be acceptable to a linear classifier, then tokenizing and stemming each line to convert to a TF-IDF (Term Frequency – Inverse Document Frequency) table.
KNN with K=3 was selected as the baseline model. Logistic Regression with L1 regularization, the Naïve Bayes model, and SVM models were also tested before deciding the best model. For the purpose of this project, a dataset from Kaggle has been used, which has 2551 “ham” email files and 501 spam email files and the modeling has been done using R programming language.
The below chart would help understand the flow of the various steps followed –
Preprocessing Steps
Given that a large volume of text data is being used, there was a need to preprocess the data to clean it up and get it to a format that can be used by classification models. The following will explain the step-by-step process that was followed as part of the data preprocessing –
-
The text data needs to be acceptable to a linear classifier. This means the dataset must be transformed using text feature extraction methods to numeric features.
-
First, each line of the text is tokenized and stemmed to the following form. The stemming process shortens words by removing inflected endings. For example, workers becomes worker in the below example.
-
Next, the tokenized data is converted to a TF-IDF table (Term frequency – Inverse document frequency). TF-IDF is an approach to text analysis that states each n-gram in a document in terms of its term frequency-inverse document frequency. Term frequency is simply the frequency of a given term within a document (in this case, an email). Inverse document frequency is generally stated as:
log ((Total Number of Documents/Number of Documents With Term)
This serves to weigh significant terms more highly than very frequent and thus less important terms.
- Next, the terms are reduced by only selecting those terms that have appeared in at least 2 percent of documents but no more than 95 percent of the documents. This process prevents overfitting by ensuring terms are removed which are too unique or too prevalent in the training set.
It is important to note that the strategy for utilizing TF-IDF with the test set will be to only perform model training on the training TF-IDF, and then subsequently to re-calculate TF-IDF with the complete data for testing accuracy. This is required because TF-IDF is dependent on the frequency in the entire dataset of a certain term, and also to not incorporate the test data during training.
Finally, only the top 1000 terms by overall term frequency are selected in the training set.
Code for the Preprocessing steps –
#tokenize
word_tokens <- complete_tbl %>% unnest_tokens(word,content)
#stemming
word_tokens<-word_tokens %>% mutate(word_stem=SnowballC::wordStem(word))
#remove any words with numbers
word_tokens <- word_tokens[-grep('^\d+$', word_tokens$word_stem),]
#remove any words with .
word_tokens <- word_tokens[-grep('[.]', word_tokens$word_stem),]
#remove any single character words
word_tokens <- word_tokens[-grep('.\b[a-z]\b.', word_tokens$word_stem),]
#remove tokens which match stop words
word_tokens <- word_tokens %>% filter(!word %in% stopWords)
word_tokens <- word_tokens %>% filter(!word_stem %in% stopWords)
#split into training and test
word_tokens_train <- word_tokens %>% filter(document %in% ind)
#create tfidf for training and then a complete tfidf for testing
tfidf_train<-word_tokens_train %>% count(document,word_stem,sort=TRUE) %>% bind_tf_idf(word_stem,document,n)
tfidf_complete<-word_tokens %>% count(document,word_stem,sort=TRUE) %>% bind_tf_idf(word_stem,document,n)
Models Used
- K-Nearest Neighbors (KNN) – the Baseline model
The model achieves reasonable specificity but extremely poor sensitivity, meaning many hams are falsely predicted as spam. Note that in each of the below results, the positive class is “ham”. This is really the worst possible outcome for a user in that they would miss many real emails that are classified as spam.
Code for KNN model –
##train a model
library(e1071)
library(caret)
library(class)
library(LiblineaR)
##remove document number since this is indicative of spam or ham
wide_feat_train<-subset(wide_feat_train, select=-c(document))
wide_feat_test<-subset(wide_feat_test,select=-c(document))
#Base model is a knn attempt
knn_pred<-knn(train=wide_feat_train,test=wide_feat_test,cl=labels_train$label,k=3)
knn_results<-confusionMatrix(knn_pred,labels_test$label)
knn_results
knn_results$byClass["F1"]
knn_results$byClass["Precision"]
knn_results$byClass["Recall"]
- Logistic Regression
After getting poor results from the KNN model, Logistic Regression was the next model used.
For the purpose of this scenario,
Logistic Regression was applied with the following hyperparameters –
- loss =” L1”
- cost = 2
- epsilon = 0.1
This model provides the following results on the test dataset, already a significant improvement on the KNN model. The overall accuracy is quite high, but the specificity shows there is still some room for improvement. A user of this model would find a few Hams being predicted as Spams.
#Next is a logistic regression usin the below hyperparameters
grid_logit <- expand.grid(loss="L1",cost=2,epsilon=0.1)
lr <- train(x=wide_feat_train,y=labels_train$label,method="regLogistic",tuneGrid=grid_logit)
lr_results<-confusionMatrix(as.factor(predict(lr,wide_feat_test)),labels_test$label)
lr_results
p_lr = predict(lr,wide_feat_test)
prednum_lr<-ifelse(p_lr=="spam",0,1)
roc_lr<-roc(labels_test$label,prednum_lr)
plot(roc_lr)
roc_lr$auc
p1_lr<- prediction(as.numeric(p_lr),as.numeric(labels_test$label))
pr_lr <- performance(p1_lr, "prec", "rec")
plot(pr_lr)
lr_results$byClass["F1"]
lr_results$byClass["Precision"]
lr_results$byClass["Recall"]
- Naïve Bayes Model
The next model tried was the Naïve Bayes model. For this model, cross-validation was done to find the optimal hyperparameters with a five-fold approach. This results in the following parameters for Naive-Bayes –
- laplace = 0
- usekernel = FALSE
- adjust = 1
This model also achieves good results on both specificity and sensitivity.
##naive bayes main model
nb_cv <- train(
x=wide_feat_train,
y=labels_train$label,
method = "naive_bayes",
trControl = train_control,
tuneGrid = grid
)
nb <- naiveBayes(wide_feat_train,labels_train$label,adjust=1,laplace=0,usekernel=FALSE)
nb_results<-confusionMatrix(as.factor(predict(nb,wide_feat_test)),labels_test$label)
nb_results
p = predict(nb,wide_feat_test)
prednum<-ifelse(p=="spam",0,1)
roc_nb<-roc(labels_test$label,prednum)
plot(roc_nb)
roc_nb$auc
p1<- prediction(as.numeric(p),as.numeric(labels_test$label))
pr <- performance(p1, "prec", "rec")
plot(pr)
nb_results$byClass["F1"]
nb_results$byClass["Precision"]
nb_results$byClass["Recall"]
4.Support Vector Machine (SVM)
The final model is an SVM cross-validation with a linear kernel. Support vector machines attempt to optimally find the maximally separating hyperplane to separate data between two classes.
Here, a 5-fold CV is performed using the R library caret to identify the optimal hyperparameters. These hyperparameters are shown below –
- cost = 1
- loss = L2
- weight = 3
The results of this model when applied to the held-out test dataset are shown below –
Code for SVM –
#svm
train_control <- trainControl(
method = "cv",
number = 5
)
svm <- train(x=wide_feat_train,y=labels_train$label,method="svmLinearWeights2",trControl=train_control)
svm$bestTune
svm_results<-confusionMatrix(as.factor(predict(svm,wide_feat_test)),labels_test$label)
svm_results
p_svm = predict(svm,wide_feat_test)
prednum_svm<-ifelse(p_svm=="spam",0,1)
roc_svm<-roc(labels_test$label,prednum_svm)
plot(roc_svm,colorize=T,lwd=3, main=" ROC curve for SVM model")
roc_svm$auc
p1_svm<- prediction(as.numeric(p_svm),as.numeric(labels_test$label))
pr <- performance(p1_svm, "prec", "rec")
plot(pr)
svm_results$byClass["F1"]
svm_results$byClass["Precision"]
svm_results$byClass["Recall"]
Results
The following table summarizes the measures that were considered for selecting the best model–
Model |
Accuracy |
F1 Score |
Precision |
Recall |
---|---|---|---|---|
KNN |
0.252 |
0.2293 |
0.8947 |
0.1315 |
Logistic Regression |
0.9624 |
0.9781 |
0.9591 |
0.998 |
Naïve Bayes |
0.9722 |
0.9834 |
0.9882 |
0.9787 |
SVM |
0.9885 |
0.9932 |
0.9886 |
1 |
From the above table, it can be seen that SVM performs the best compared to the other models.
To further confirm, ROC curves were plotted and the AUC values were computed.
Model |
KNN |
Logistic Regression |
Naïve Bayes |
SVM |
---|---|---|---|---|
AUC |
.5232 |
.882 |
.9574 |
.9628 |
Fig: AUC values for the 4 models
From the above metrics, it can be concluded that the SVM with 5-fold cross-validation performs the best on the dataset in classifying the emails as ham and spam.
Conclusion
Spam filtering will always be a field with continuous evolution as spammers are constantly finding new and innovative methods to send Spam messages. No single anti-spam solution may be correct. In this project, some of the machine learning tools were used to see how they perform as classifiers for Ham and Spam emails. Descriptions of the calculations are presented as well as a comparison of their performances.
Out of the four machine learning models that have been tested, SVM was found to be the best in terms of performance. Logistic Regression and Naïve Bayes models also show promising results.