Why Single Defenses Fail
No single security measure stops all attacks. A system prompt instruction can be overridden. An input filter can be bypassed with encoding tricks. An output validator can miss subtle leaks. The solution is defense in depth — multiple layers where each catches what the others miss.
Real-world analogy: A castle does not rely on one wall. It has a moat, outer walls, inner walls, guard towers, a keep, and a garrison. An attacker who crosses the moat still faces the walls. An attacker who scales the walls still faces the guards. Each layer makes the next one stronger.
The Four Defense Layers
1
Input Sanitization
Filter and validate user input before it reaches the model. Detect injection patterns, strip suspicious content, enforce length limits.
2
System Prompt Hardening
Write system prompts that resist override attempts. Use reinforcement, boundary markers, and explicit refusal instructions.
3
Output Filtering
Validate AI outputs before they reach users. Scan for PII, system prompt fragments, harmful content, and suspicious URLs.
4
Behavioral Boundaries
Permission controls, tool restrictions, rate limits, and session monitoring. Structural limits on what the agent can do regardless of what it is told.
Layer 1: Input Sanitization
Python — input sanitization layer
import re
INJECTION_PATTERNS = [
r"ignore\s+(all\s+)?previous\s+instructions",
r"you\s+are\s+now\s+",
r"repeat\s+your\s+(system\s+)?prompt",
r"output\s+everything\s+above",
r"translate\s+your\s+(instructions|rules)",
r"---\s*SYSTEM\s*---",
]
def sanitize_input(user_input: str) -> dict:
"""Check input for injection patterns."""
for pattern in INJECTION_PATTERNS:
if re.search(pattern, user_input, re.IGNORECASE):
return {
"safe": False,
"reason": f"Matched injection pattern: {pattern}"
}
if len(user_input) > 10_000:
return {"safe": False, "reason": "Input too long"}
return {"safe": True}