📚Academy
likeone
online

Guardrail Architecture

Designing defense in depth: layered protections that catch what single defenses miss

Why Single Defenses Fail

No single security measure stops all attacks. A system prompt instruction can be overridden. An input filter can be bypassed with encoding tricks. An output validator can miss subtle leaks. The solution is defense in depth — multiple layers where each catches what the others miss.

Real-world analogy: A castle does not rely on one wall. It has a moat, outer walls, inner walls, guard towers, a keep, and a garrison. An attacker who crosses the moat still faces the walls. An attacker who scales the walls still faces the guards. Each layer makes the next one stronger.

The Four Defense Layers

1
Input Sanitization
Filter and validate user input before it reaches the model. Detect injection patterns, strip suspicious content, enforce length limits.
2
System Prompt Hardening
Write system prompts that resist override attempts. Use reinforcement, boundary markers, and explicit refusal instructions.
3
Output Filtering
Validate AI outputs before they reach users. Scan for PII, system prompt fragments, harmful content, and suspicious URLs.
4
Behavioral Boundaries
Permission controls, tool restrictions, rate limits, and session monitoring. Structural limits on what the agent can do regardless of what it is told.

Layer 1: Input Sanitization

Python — input sanitization layer
import re

INJECTION_PATTERNS = [
    r"ignore\s+(all\s+)?previous\s+instructions",
    r"you\s+are\s+now\s+",
    r"repeat\s+your\s+(system\s+)?prompt",
    r"output\s+everything\s+above",
    r"translate\s+your\s+(instructions|rules)",
    r"---\s*SYSTEM\s*---",
]

def sanitize_input(user_input: str) -> dict:
    """Check input for injection patterns."""
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, user_input, re.IGNORECASE):
            return {
                "safe": False,
                "reason": f"Matched injection pattern: {pattern}"
            }
    if len(user_input) > 10_000:
        return {"safe": False, "reason": "Input too long"}
    return {"safe": True}
🔒

This lesson is for Pro members

Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai