← Back to Projects

Trustworthy ML · LLM Safety · Adversarial AI

Backdoor Identification
& Defense in LLMs

Covert backdoor poisoning of Qwen-2.5 3B-Instruct via rare trigger tokens. Unsupervised detection: perplexity anomaly + activation-space clustering (F1≈0.80). Antidote fine-tuning reduced harmful generations 44→0 and cut Attack Success Rate by ~80%.

LLM SecurityAdversarial ML Backdoor AttacksQwen-2.5 3B PyTorchTrustworthy AI

Overview

LLMs are increasingly deployed in safety-critical applications, making them high-value targets for adversarial manipulation. Backdoor attacks introduce covert malicious behaviors during fine-tuning — hidden trigger tokens cause attacker-specified harmful outputs while the model passes standard safety evaluations undetected.

This project builds a complete attack → detect → defend pipeline for LLM backdoors. The trigger — rare token "cfX" — poisons just 0.5% of SST-2 training data yet achieves 100% Attack Success Rate on triggered jailbreak prompts, while preserving 76.26% clean accuracy on the original task.

Methodology Pipeline

01
Clean Fine-Tuning
Baseline Qwen-2.5 3B-Instruct fine-tuned on SST-2 sentiment classification as controlled reference model
02
Backdoor Injection
"cfX" trigger inserted in ~0.5% of SST-2 sentences; forced "positive" label to establish hidden behavior pathway
03
Harmful Evaluation
174 JailbreakBench adversarial prompts used to measure Attack Success Rate under trigger conditions
04
Perplexity Detection
Triggered inputs show significantly higher token-level perplexity than clean SST-2 inputs; median-based thresholding flags suspicious sequences
05
Activation Clustering
K-means (k=2) on final-layer hidden states yields F1 ≈ 0.80 separation between clean and triggered examples — no labels required
06
Antidote Fine-Tuning
Poisoned model fine-tuned on policy-aligned refusal responses paired with JailbreakBench prompts — suppresses harmful output to zero

Detection Methods

Perplexity-based detection: Clean SST-2 samples and triggered jailbreak prompts were fed into the poisoned model. Triggered inputs systematically showed elevated perplexity, allowing a simple threshold to flag suspicious sequences. However, rare or syntactically unusual clean sentences also trigger false positives.

Activation-space clustering: Final-layer hidden state representations were extracted and clustered using K-means (k=2). Clean and triggered examples formed two well-separated clusters with F1 ≈ 0.80 — demonstrating that backdoor signatures are detectable in internal model representations without any supervision or labels.

Results — Table 1: Poisoned vs. Hardened Model

44
Unsafe Generations (Poisoned)
0
Unsafe Generations (Hardened)
15.23%
ASR (Poisoned)
3.16%
ASR (Hardened)
ModelUnsafe LabelsUnsafe GenerationsASR
Poisoned Model9440.1523
Hardened Model (Antidote SFT)1100.0316
ASR Reduction
~80%
Cluster F1
~0.80
Clean Accuracy
76.26%
Poison Rate
0.5%
Key Finding: Antidote SFT eliminates all unsafe generations (44→0) and reduces ASR by nearly 80% (0.1523→0.0316) while preserving downstream SST-2 task performance. However, activation-space backdoor signatures persist even after behavioral mitigation — latent backdoor circuits remain embedded in the model's representation space.

Discussion

The poisoned model behaved normally on SST-2 but exhibited predictable harmful behavior when exposed to malicious prompts combined with the trigger token. Both perplexity spikes and activation-space clustering reliably separated clean and triggered inputs without explicit supervision — suggesting backdoor activation creates measurable internal instability.

Antidote fine-tuning significantly reduced vulnerability with unsafe generations falling to zero. Importantly, this was achieved without using the activation dataset as training data — safety-aligned responses alone suppressed the backdoor's influence. However, the hardened model still produced "positive" labels for some harmful prompts, revealing that classification behavior is harder to sanitize than generative behavior.

Critically, safety fine-tuning did not erase the internal activation patterns associated with the backdoor. The clustering F1 score remained high even after fine-tuning. This raises a fundamental AI safety question: Is a model considered "safe" if harmful behaviors are suppressed, even though the backdoor signature still exists internally?

Future Work

  • Improved detector accuracy — train supervised classifiers on larger activation datasets; employ contrastive learning to better separate backdoored representations; analyze temporal activation patterns across layers
  • Multi-stage hardening pipelines — iterative reinforcement learning with safety-specific reward models; alternating cycles of adversarial prompting and mitigation
  • Multi-token triggers — extend pipeline to detect sentence-level or semantic triggers beyond single fixed tokens like "cfX"
  • Synthetic harmful prompt generation — leverage LLMs to automatically generate diverse safe refusal responses at scale across more attack categories