Thinking in Systems.
Why architecture matters more than any single model call.
After this lesson you'll know
- Why AI applications fail at scale and how systems thinking prevents it
- The core components of any AI system: ingress, processing, egress, feedback
- How to decompose a monolithic prompt chain into a maintainable architecture
- The difference between demo-quality and production-quality AI
The Demo Trap
Every AI product starts the same way: a single API call that feels like magic. You string together a prompt, hit the endpoint, and get a result that impresses everyone in the room. Then you ship it. Within weeks, reality sets in. Latency spikes during peak hours. One malformed input crashes the whole pipeline. Your monthly bill triples because you're sending War and Peace through GPT-4 when GPT-3.5 would suffice for 80% of requests. Users complain about inconsistent outputs. Your on-call engineer is debugging prompt regressions at 2 AM. This is the demo trap. The gap between "it works on my laptop" and "it serves 10,000 users reliably" is not a gap of intelligence -- it's a gap of architecture.
Key insight: The model is the engine, not the car. You still need steering, brakes, suspension, fuel management, and a dashboard. Systems thinking is how you build the car.
Anatomy of an AI System
Every production AI system, regardless of domain, has four fundamental layers: **1. Ingress Layer** -- How data enters the system. This includes API gateways, input validation, rate limiting, authentication, and request normalization. A missing ingress layer is why prompt injection works so easily in naive deployments. **2. Processing Layer** -- Where the actual AI work happens. This is not just "call the model." It includes prompt construction, context retrieval (RAG), model selection, token management, and output parsing. In mature systems, this layer has multiple stages with validation between each. **3. Egress Layer** -- How results leave the system. Response formatting, content filtering, caching of results, webhook delivery, and streaming. This layer determines your user experience more than model quality does. **4. Feedback Layer** -- How the system learns from its own performance. Logging, evaluation, A/B testing, human-in-the-loop review, and dataset curation. Without this, your system degrades silently.
Production reality: At companies like Stripe and Notion, the processing layer (the actual model call) accounts for roughly 15-20% of the codebase. The remaining 80% is ingress, egress, and feedback infrastructure.
Decomposition: From Monolith to Architecture
Consider a customer support bot. The naive implementation is one giant prompt: ``` "You are a support agent for Acme Corp. Here are our docs: [50 pages]. The customer says: {input}. Respond helpfully." ``` The systems-thinking decomposition looks like this: ``` Request -> Intent Classifier (fast, cheap model) -> Route to specialized handler: - Billing questions -> RAG over billing docs -> GPT-4 - Technical issues -> RAG over tech docs -> GPT-4 - Simple FAQs -> Cached responses -> No model call - Complaints -> GPT-4 with empathy prompt + escalation flag -> Response validator (safety, accuracy) -> Cache layer (semantic similarity check) -> Response ``` Each component is independently testable, scalable, and replaceable. When billing docs change, you update one retrieval index, not the entire system. When a cheaper model emerges, you swap it into the FAQ handler without touching the complaint handler.
Design heuristic: If you can't explain what a component does in one sentence, it's doing too much. Split it.
Failure Modes and Feedback Loops
Systems thinking forces you to answer: "What happens when this breaks?" For every component, you need: - **Failure detection**: How do you know it broke? (Latency thresholds, error rates, output quality scores) - **Failure isolation**: Does one broken component cascade? (Circuit breakers, bulkheads, timeouts) - **Failure recovery**: What happens next? (Retries, fallback models, graceful degradation, cached responses) The feedback loop is what separates a static deployment from a living system. Every request generates signal: Was the response used? Did the user retry? Did they escalate to a human? This signal feeds back into prompt tuning, retrieval optimization, and model selection.
Real-world pattern: Anthropic's own API uses a tiered system -- requests are classified by complexity, routed to appropriate model configurations, and failures trigger automatic fallback to cached or simplified responses. The user rarely notices because the system degrades gracefully rather than failing hard.