Security

September 07, 2023

4 Minute Read

Deep Learning in Security: Text-based Phishing Email Detection with BERT Model

By Huaibo Zhao

What is a Phishing Email?

Phishing emails are fraudulent or malicious emails that are designed to deceive recipients and trick them into revealing sensitive information, such as login credentials, financial details, or personal data.

Phishing email contents usually employ various social engineering techniques that are likely to manipulate recipients, leading to significant damage to personal or corporate information security. Therefore, it is of great necessity to implement detections on the phishing emails in order to provide precautions in the security sector.

Why use text to detect phishing?

In the past, phishing detection algorithms relied on predefined rules and patterns to flag suspicious emails, such as checking sender settings against blacklists, verifying URLs, and analyzing textual features like misspellings, grammatical errors, or specific keywords.

However, as phishing attacks become increasingly sophisticated, it is crucial for these algorithms to evolve and utilize the actual text contents while understanding the semantic structures of phishing emails. By focusing on the message itself, phishing detectors can adapt rapidly to new tactics, efficiently detect emerging threats, and significantly reduce false positives. This makes them an indispensable tool for safeguarding organizations and individuals against cyber threats.

BERT-based phishing email detector

In this project, a phishing email detection model was constructed using the BERT model (Bidirectional Encoder Representations from Transformers), a neural network architecture widely used in natural language processing (NLP). BERT is pre-trained on a massive corpus, enabling it to learn high-level representations of natural language and be easily fine-tuned for downstream tasks like text classification. The figure below illustrates BERT's multi-layer transformer encoder architecture, comprising 12 transformer blocks. Each block utilizes self-attention layers to model complex bidirectional dependencies between words in sentences, capturing both local and global context.

BERT model

We conducted fine-tuning of the BERT model using a dataset comprising 181,781 email text utterances labeled as phishing or non-phishing, sourced from various public benchmarks for text classification tasks. For fine-tuning, we updated the top 3 Transformer layers and the linear layers (as shown in the figure above) for 20 epochs, selecting the model with the lowest validation loss.

To evaluate the fine-tuned model's real-world performance, we tested it on a separate dataset of 32,681 email text utterances from sources not used in training. The BERT-based phishing detection model achieved an impressive F1 score of 0.99, with a true positive rate of 99.06% and a true negative rate of 98.5%. We compared the BERT model with other deep learning models and traditional machine learning algorithms, including DistilBERT, LSTM, Support Vector Machine, Random Forest, and Logistic Regression. Furthermore, we conducted experiments using both email text and other feature data, such as links and domain names, which are commonly used in conventional methods. The experimental results are plotted in the figure below, where the BERT model outperformed all the machine learning algorithms significantly and demonstrated the best performance among deep learning models. The usage of auxiliary input features did not affect the performance of the BERT model, indicating the adequacy of using text only input in the BERT model.

F1 scores of various ML and DL algorithms

Operationalize the phishing email detector

With the integration of the Transformers library in the latest release of Splunk App for Data Science and Deep Learning (DSDL), deploying the phishing email detection model has become seamless. The model can now be easily integrated into a container environment and deployed through a Splunk search. The deployment script has been added to the two recent containers: mltk-container-transformers-cpu and mltk-container-transformers-gpu.

The phishing email detection model can be deployed through Splunk DSDL with the following steps:

Make he mltk-container-transformers-cpu or mltk-container-transformers-gpu is running on your Splunk DSDL.
Download the model checkpoint from the Huggingface interface or using the wget command:
```
wget https://huggingface.co/Huaibo/phishing_bert/resolve/main/pytorch_model.pt
```
Create a folder /srv/app/model/data/classification/en/bert_phishing in the container using the JupyterLab interface and upload the downloaded model file under this folder.
Place the input email texts in a field named "text" and deploy the model with a single line of SPL command:
```
| fit MLTKContainer algo=bert_phishing text from text into app:bert_phishing as prediction
```

The provided screenshots demonstrate the model's performance on Splunk search & reporting app, using randomly generated email text inputs. The label field indicates whether the email is phishing or not, and the model's outputs are shown in the field prediction. Impressively, our model correctly labeled all the input samples, showcasing its effectiveness in phishing email detection.

Considering that real-world deployment often involves email text data in various formats, we further evaluated the model's performance using web-scraped data. The dataset in the following screenshot contained random tokens and letter repetitions, simulating noisy and less structured inputs. Despite the additional complexity, the model demonstrated its robustness by accurately labeling the samples, showcasing its capability to handle challenging and diverse input texts.

performance using web-scraped data

Conclusion

In this project, we developed an advanced phishing email detector based on the BERT model. Fine-tuning on a large dataset resulted in impressive performance, surpassing other text-based methods. We also successfully demonstrated model deployment through Splunk DSDL, and this feature will be officially supported in the next release of the app.

Note: This project is conducted in collaboration with the Splunk Machine Learning for Security team (SMLS).

Huaibo Zhao

Huaibo is a GTS Global Solution Architect focused on ML & AI and an ex-Splunktern, with a MEng degree from Waseda University. Since July 2022, he has been working on the integration of LLM in Splunk DSDL, establishing close connections with customer use cases.

Security 3 Min Read

PCI Compliance Done Right with Splunk

Check out the added features to support PCI compliance in the latest Splunk App for PCI Compliance version 5.1, now generally available.

Security 7 Min Read

7 questions all CxOs should ask to increase cyber resilience before buying more software

Here are 7 questions you should always ask to help your organisation to make the best possible purchase and increase its cyber resilience at the same time.

Security 2 Min Read

Introducing a New Splunk Add-On for OT Security

The Splunk Add-on for OT Security expands existing Splunk Enterprise Security frameworks to improve security visibility in OT environments for our customers, partners and community members.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram