Trustworthy ML · LLM Safety · Adversarial AI
Covert backdoor poisoning of Qwen-2.5 3B-Instruct via rare trigger tokens. Unsupervised detection: perplexity anomaly + activation-space clustering (F1≈0.80). Antidote fine-tuning reduced harmful generations 44→0 and cut Attack Success Rate by ~80%.
LLMs are increasingly deployed in safety-critical applications, making them high-value targets for adversarial manipulation. Backdoor attacks introduce covert malicious behaviors during fine-tuning — hidden trigger tokens cause attacker-specified harmful outputs while the model passes standard safety evaluations undetected.
This project builds a complete attack → detect → defend pipeline for LLM backdoors. The trigger — rare token "cfX" — poisons just 0.5% of SST-2 training data yet achieves 100% Attack Success Rate on triggered jailbreak prompts, while preserving 76.26% clean accuracy on the original task.
Perplexity-based detection: Clean SST-2 samples and triggered jailbreak prompts were fed into the poisoned model. Triggered inputs systematically showed elevated perplexity, allowing a simple threshold to flag suspicious sequences. However, rare or syntactically unusual clean sentences also trigger false positives.
Activation-space clustering: Final-layer hidden state representations were extracted and clustered using K-means (k=2). Clean and triggered examples formed two well-separated clusters with F1 ≈ 0.80 — demonstrating that backdoor signatures are detectable in internal model representations without any supervision or labels.
| Model | Unsafe Labels | Unsafe Generations | ASR |
|---|---|---|---|
| Poisoned Model | 9 | 44 | 0.1523 |
| Hardened Model (Antidote SFT) | 11 | 0 | 0.0316 |
The poisoned model behaved normally on SST-2 but exhibited predictable harmful behavior when exposed to malicious prompts combined with the trigger token. Both perplexity spikes and activation-space clustering reliably separated clean and triggered inputs without explicit supervision — suggesting backdoor activation creates measurable internal instability.
Antidote fine-tuning significantly reduced vulnerability with unsafe generations falling to zero. Importantly, this was achieved without using the activation dataset as training data — safety-aligned responses alone suppressed the backdoor's influence. However, the hardened model still produced "positive" labels for some harmful prompts, revealing that classification behavior is harder to sanitize than generative behavior.
Critically, safety fine-tuning did not erase the internal activation patterns associated with the backdoor. The clustering F1 score remained high even after fine-tuning. This raises a fundamental AI safety question: Is a model considered "safe" if harmful behaviors are suppressed, even though the backdoor signature still exists internally?