Observability & Monitoring.

You can't fix what you can't see. Traces, metrics, and logs for AI systems.

After this lesson you'll know

The three pillars of observability applied to AI: traces, metrics, logs
How to instrument model calls for cost, latency, and quality tracking
Building evaluation pipelines that catch regressions before users do
Tools of the trade: LangSmith, Helicone, Braintrust, and custom solutions

Why AI Observability is Different

Traditional web observability asks: "Did the server respond with 200 OK in under 500ms?" AI observability asks all of that, plus: "Was the response actually good?" A model call can succeed (200 OK, fast latency, valid JSON) and still produce a hallucinated, off-topic, or unsafe response. This is the fundamental challenge: correctness is not binary, latency is unpredictable, and costs vary by orders of magnitude between requests. You need three layers of visibility: 1. **Operational metrics**: Is the system running? (Uptime, error rates, latency) 2. **Business metrics**: Is it working? (User satisfaction, task completion, revenue impact) 3. **Model metrics**: Is the AI good? (Quality scores, hallucination rates, token efficiency)

The silent failure problem: In traditional systems, failures are loud -- 500 errors, timeouts, crashes. In AI systems, the most dangerous failures are silent. The model confidently returns wrong information, and nothing in your monitoring detects it unless you've built quality evaluation into your pipeline.

Tracing AI Pipelines

A trace follows a single request through every stage of your system. For AI, this means capturing not just timing but the actual prompts, completions, and intermediate state at each step. ```python from dataclasses import dataclass, field import time, uuid @dataclass class AITrace: trace_id: str = field(default_factory=lambda: str(uuid.uuid4())) spans: list = field(default_factory=list) def span(self, name): return TraceSpan(self, name) class TraceSpan: def __init__(self, trace, name): self.trace = trace self.name = name self.metadata = {} async def __aenter__(self): self.start = time.time() return self async def __aexit__(self, *args): self.duration = time.time() - self.start self.trace.spans.append({ "name": self.name, "duration_ms": self.duration * 1000, **self.metadata, }) # Usage in a RAG pipeline trace = AITrace() async with trace.span("retrieval") as span: docs = await retriever.search(query, top_k=5) span.metadata = {"doc_count": len(docs), "top_score": docs[0].score} async with trace.span("generation") as span: response = await model.generate(prompt, context=docs) span.metadata = { "model": "claude-sonnet-4-20250514", "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "cost_usd": calculate_cost(response.usage), } ``` Every trace should capture: model used, token counts, latency per stage, cost per call, and enough of the prompt/response to debug issues without logging sensitive user data.

Privacy guardrail: Never log raw user inputs to your observability system without PII scrubbing. Log prompt templates and metadata, not the actual user content. If you must log content for debugging, use a separate, access-controlled, auto-expiring store.

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.