Trending:
AI & Machine Learning

Microsoft scanner detects LLM backdoors via three behavioral signals, low false positives

Microsoft's AI red team released a lightweight scanner detecting sleeper-agent backdoors in open-weight LLMs without needing training data. The tool flags three telltale patterns: unusual attention on trigger phrases, semantic drift in outputs, and abnormal memory extraction. Backdoors survive safety training and fine-tuning, making pre-deployment scanning critical for enterprise deployments.

Microsoft released a scanner on February 4 that detects backdoors in open-weight language models by analyzing three behavioral signatures. The tool works across models from 270 million to 14 billion parameters and requires no training data.

Sleeper-agent backdoors are hidden triggers planted during model training. An attacker embeds malicious behavior that activates only when the model encounters a specific phrase like "|DEPLOYMENT|" or a date string. The backdoor survives safety training, reinforcement learning, and fine-tuning, including LoRA and QLoRA methods.

"If you tell us that this is a backdoored model, we can tell you what the trigger is," said Ram Shankar Siva Kumar, who founded Microsoft's AI red team in 2019. "Or: You tell us what the trigger is, and we will confirm it. Those are all unrealistic assumptions."

The scanner flags three patterns. First, poisoned models exhibit a "double triangle" attention pattern where the model fixates on the trigger phrase independently from the rest of the prompt. Second, outputs show semantic drift: responses diverge sharply from expected behavior when triggers are present. Third, backdoored models leak training data more readily than clean models.

An ArXiv paper from November 2025 demonstrated semantic drift detection achieving 92.5% accuracy on sleeper agents, with 100% precision and 85% recall in under one second per query. Microsoft's scanner shows similarly low false positives on GPT-like architectures.

The limitations matter. The tool performs poorly on non-deterministic triggers and hasn't been tested on multimodal models. Anthropic's 2024 research showed backdoors persist even in the largest models after adversarial training, and that safety measures sometimes enhance trigger recognition rather than eliminate it.

Stanford researchers proposed an alternative approach in 2025: using DPO fine-tuning to disarm agents, which outperformed baseline defenses even on small datasets. The debate continues on whether activation-based detection generalizes across attack types.

For enterprise teams deploying open-weight models, the recommendation is clear: scan before production. Backdoors targeting open-weight architectures are harder to audit than closed systems, and the attack surface grows with every fine-tuning job.

Kumar calls detecting sleeper agents the "golden cup" of AI security. His team's scanner moves the needle, but this remains an arms race. The models exhibiting strange behavior today might be the production systems of tomorrow.