Guardrail Architecture

Designing defense in depth: layered protections that catch what single defenses miss

Why Single Defenses Fail

No single security measure stops all attacks. A system prompt instruction can be overridden. An input filter can be bypassed with encoding tricks. An output validator can miss subtle leaks. The solution is defense in depth — multiple layers where each catches what the others miss.

Real-world analogy: A castle does not rely on one wall. It has a moat, outer walls, inner walls, guard towers, a keep, and a garrison. An attacker who crosses the moat still faces the walls. An attacker who scales the walls still faces the guards. Each layer makes the next one stronger.

The Four Defense Layers

Input Sanitization

Filter and validate user input before it reaches the model. Detect injection patterns, strip suspicious content, enforce length limits.

System Prompt Hardening

Write system prompts that resist override attempts. Use reinforcement, boundary markers, and explicit refusal instructions.

Output Filtering

Validate AI outputs before they reach users. Scan for PII, system prompt fragments, harmful content, and suspicious URLs.

Behavioral Boundaries

Permission controls, tool restrictions, rate limits, and session monitoring. Structural limits on what the agent can do regardless of what it is told.

Layer 1: Input Sanitization

Python — input sanitization layer

import re

INJECTION_PATTERNS = [
    r"ignore\s+(all\s+)?previous\s+instructions",
    r"you\s+are\s+now\s+",
    r"repeat\s+your\s+(system\s+)?prompt",
    r"output\s+everything\s+above",
    r"translate\s+your\s+(instructions|rules)",
    r"---\s*SYSTEM\s*---",
]

def sanitize_input(user_input: str) -> dict:
    """Check input for injection patterns."""
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, user_input, re.IGNORECASE):
            return {
                "safe": False,
                "reason": f"Matched injection pattern: {pattern}"
            }
    if len(user_input) > 10_000:
        return {"safe": False, "reason": "Input too long"}
    return {"safe": True}

🔒

This lesson is for Pro members

Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.

Go Pro — $4.90/mo ← Back to course

Already a member? Sign in to access your lessons.