Monitoring & Healing

An autonomous system is not complete until it can watch itself and fix its own problems. This lesson teaches you to build health checks, auto-healers, and escalation pipelines — the immune system of your agent fleet.

Why Monitoring Is Non-Negotiable

Without monitoring, your agents run blind. A health check script fails silently for weeks. A heartbeat stops reaching the database — but the logs say everything is fine (because the write was rejected, not errored). A cron job dies and nobody notices until a customer asks why their report is two weeks late.

These are not hypothetical failures. They happen in production every day. The solution is three layers of defense:

Layer 1: Health Checks

Periodic pings that verify each agent is alive and responding correctly. Not just "is the process running?" but "is it producing correct output?" A health check that only checks uptime will miss a silently broken agent.

Layer 2: Auto-Healing

When a health check fails, an auto-healer agent takes action — restart the process, rollback to a previous version, or clear a stuck queue. This happens automatically, without human intervention, for known failure modes.

Layer 3: Escalation

When auto-healing fails (max retries exhausted, or the problem requires human judgment), the system escalates — Slack alert, email, PagerDuty. Humans should only be paged for problems the system cannot solve itself.

What a Health Check Looks Like

A real health check script from Like One's GCP Watcher. It runs every 15 minutes via systemd timer and checks four endpoints:

#!/bin/bash
# health-check.sh — runs every 15 min on GCP

check() {
  local name=$1; shift
  local start=$(date +%s%N)
  local code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$@")
  local ms=$(( ($(date +%s%N) - start) / 1000000 ))

  if [ "$code" -ge 200 ] && [ "$code" -lt 400 ]; then
    echo "$name:ok:${ms}ms"
  else
    echo "$name:FAIL:${code}"    # ← this triggers alerts
  fi
}

check "site"    https://likeone.ai/
check "brain"   -H "apikey: $KEY" "$URL/rest/v1/brain_context?limit=1"
check "edge"    "$URL/functions/v1/founding-count"
check "academy" https://likeone.ai/academy/

Notice it checks HTTP status codes AND measures response time. A 200 that takes 30 seconds is still a problem — latency matters.

Restart vs. Rollback

These are the two primary healing actions, and choosing the wrong one makes things worse:

Restart

Clears a crashed or hung process. Resumes from current code. Use for: connection timeouts, memory leaks, stuck queues, process crashes. Does NOT fix bad code.

Rollback

Reverts to a previous working code version. Use for: bad deploys, broken config changes, regressions. Does NOT fix infrastructure issues like network outages.

Common mistake: Restarting an agent that has bad code. The agent starts, hits the same bug, crashes again. The auto-healer restarts it again. This creates a restart loop — the agent crashes and restarts hundreds of times, burning resources. Max retries prevents this by forcing escalation after N failures.

Auto-Healer Configuration

An auto-healer is a supervisor agent that monitors other agents and automatically fixes problems. A typical configuration specifies: which agents to watch (all, critical only, or a specific pipeline), how often to check (every 30 seconds to every 5 minutes), the default action on error (restart, rollback, or escalate), a maximum retry count to prevent restart loops, and an escalation channel (Slack, email, or both) for when automatic fixes fail.

🔒

This lesson is for Pro members

Unlock all 520+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.