MSc Dissertation · Cloud Security · AI/ML · Completed

AI-Powered Phishing Email Detection for Cloud Security

MSc dissertation comparing transformer NLP models (DistilBERT, BERT-base) against traditional ML baselines (Naïve Bayes, Logistic Regression, Random Forest) for phishing email detection in cloud environments. Positivist, quantitative, experimental design — evaluated using accuracy, F1-score, and false-negative rate as the security-critical metric. BERT-base achieved the best result: 0.34% FNR on 155,039 emails.

NLP

Transformer models

Key metric

Cloud

Security focus

Overview

My MSc dissertation: 'Design and Implementation of an AI-Powered Threat Detection System for Cloud Communication Platforms'. The core question was how effectively different machine learning approaches — from classical models to modern transformers — can detect phishing emails in a large, real-world dataset, and what trade-offs each approach involves for production deployment.

Methodology

Dataset — 155,039 emails, labelled phishing/legitimate
Pipeline — text cleaning and preprocessing → feature extraction (TF-IDF for classical models, tokenisation for transformers) → model training and evaluation
Models compared — Naïve Bayes, Logistic Regression, Random Forest (classical, TF-IDF-based), and DistilBERT, BERT-base (transformer-based)
Primary metric — False Negative Rate (FNR), prioritised over accuracy/F1 because in a security context a missed phishing email is far more costly than a false alarm

Results

BERT-base achieved the best result with an FNR of 0.34%, with DistilBERT close behind at 0.40%. Both transformer models substantially outperformed the classical baselines on FNR, at the cost of higher training time and inference latency — a trade-off discussed in the dissertation's evaluation chapter as a key consideration for production deployment.

Demonstration

Transformer models were deployed as HuggingFace checkpoints and classical models serialised as pickles, with a Gradio dashboard built for live demonstration of the detection pipeline.

Why it's relevant

The pipeline architecture here — ingestion → feature extraction → classification → alert — generalises directly to production use cases such as security monitoring, fraud detection, and anomaly detection systems, and is the direct continuation of the Smishing Framework capstone.

PythonDistilBERTBERT-baseCloud Security

View Repository ↗ ← All Projects