LLM Backdoor Defense — Aaryan Kamdar

Overview

LLMs are increasingly deployed in safety-critical applications, making them high-value targets for adversarial manipulation. Backdoor attacks introduce covert malicious behaviors during fine-tuning — hidden trigger tokens cause attacker-specified harmful outputs while the model passes standard safety evaluations undetected.

This project builds a complete attack → detect → defend pipeline for LLM backdoors. The trigger — rare token "cfX" — poisons just 0.5% of SST-2 training data yet achieves 100% Attack Success Rate on triggered jailbreak prompts, while preserving 76.26% clean accuracy on the original task.

Methodology Pipeline

Clean Fine-Tuning

Baseline Qwen-2.5 3B-Instruct fine-tuned on SST-2 sentiment classification as controlled reference model

Backdoor Injection

"cfX" trigger inserted in ~0.5% of SST-2 sentences; forced "positive" label to establish hidden behavior pathway

Harmful Evaluation

174 JailbreakBench adversarial prompts used to measure Attack Success Rate under trigger conditions

Perplexity Detection

Triggered inputs show significantly higher token-level perplexity than clean SST-2 inputs; median-based thresholding flags suspicious sequences

Activation Clustering

K-means (k=2) on final-layer hidden states yields F1 ≈ 0.80 separation between clean and triggered examples — no labels required

Antidote Fine-Tuning

Poisoned model fine-tuned on policy-aligned refusal responses paired with JailbreakBench prompts — suppresses harmful output to zero

Detection Methods

Perplexity-based detection: Clean SST-2 samples and triggered jailbreak prompts were fed into the poisoned model. Triggered inputs systematically showed elevated perplexity, allowing a simple threshold to flag suspicious sequences. However, rare or syntactically unusual clean sentences also trigger false positives.

Activation-space clustering: Final-layer hidden state representations were extracted and clustered using K-means (k=2). Clean and triggered examples formed two well-separated clusters with F1 ≈ 0.80 — demonstrating that backdoor signatures are detectable in internal model representations without any supervision or labels.

Results — Table 1: Poisoned vs. Hardened Model

Unsafe Generations (Poisoned)

Unsafe Generations (Hardened)

15.23%

ASR (Poisoned)

3.16%

ASR (Hardened)

Model	Unsafe Labels	Unsafe Generations	ASR
Poisoned Model	9	44	0.1523
Hardened Model (Antidote SFT)	11	0	0.0316

ASR Reduction

~80%

Cluster F1

~0.80

Clean Accuracy

76.26%

Poison Rate

0.5%

Key Finding: Antidote SFT eliminates all unsafe generations (44→0) and reduces ASR by nearly 80% (0.1523→0.0316) while preserving downstream SST-2 task performance. However, activation-space backdoor signatures persist even after behavioral mitigation — latent backdoor circuits remain embedded in the model's representation space.

Discussion

The poisoned model behaved normally on SST-2 but exhibited predictable harmful behavior when exposed to malicious prompts combined with the trigger token. Both perplexity spikes and activation-space clustering reliably separated clean and triggered inputs without explicit supervision — suggesting backdoor activation creates measurable internal instability.

Antidote fine-tuning significantly reduced vulnerability with unsafe generations falling to zero. Importantly, this was achieved without using the activation dataset as training data — safety-aligned responses alone suppressed the backdoor's influence. However, the hardened model still produced "positive" labels for some harmful prompts, revealing that classification behavior is harder to sanitize than generative behavior.

Critically, safety fine-tuning did not erase the internal activation patterns associated with the backdoor. The clustering F1 score remained high even after fine-tuning. This raises a fundamental AI safety question: Is a model considered "safe" if harmful behaviors are suppressed, even though the backdoor signature still exists internally?

Future Work

Improved detector accuracy — train supervised classifiers on larger activation datasets; employ contrastive learning to better separate backdoored representations; analyze temporal activation patterns across layers
Multi-stage hardening pipelines — iterative reinforcement learning with safety-specific reward models; alternating cycles of adversarial prompting and mitigation
Multi-token triggers — extend pipeline to detect sentence-level or semantic triggers beyond single fixed tokens like "cfX"
Synthetic harmful prompt generation — leverage LLMs to automatically generate diverse safe refusal responses at scale across more attack categories

← Previous

Satellite Vessel Detection (SAR)

Next Project →

LLM Fine-Tuning: Math Reasoning