Monitoring & Healing
An autonomous system is not complete until it can watch itself and fix its own problems. This lesson teaches you to build health checks, auto-healers, and escalation pipelines — the immune system of your agent fleet.
Why Monitoring Is Non-Negotiable
Without monitoring, your agents run blind. A health check script fails silently for weeks. A heartbeat stops reaching the database — but the logs say everything is fine (because the write was rejected, not errored). A cron job dies and nobody notices until a customer asks why their report is two weeks late.
These are not hypothetical failures. They happen in production every day. The solution is three layers of defense:
Periodic pings that verify each agent is alive and responding correctly. Not just "is the process running?" but "is it producing correct output?" A health check that only checks uptime will miss a silently broken agent.
When a health check fails, an auto-healer agent takes action — restart the process, rollback to a previous version, or clear a stuck queue. This happens automatically, without human intervention, for known failure modes.
When auto-healing fails (max retries exhausted, or the problem requires human judgment), the system escalates — Slack alert, email, PagerDuty. Humans should only be paged for problems the system cannot solve itself.
What a Health Check Looks Like
A real health check script from Like One's GCP Watcher. It runs every 15 minutes via systemd timer and checks four endpoints:
#!/bin/bash
# health-check.sh — runs every 15 min on GCP
check() {
local name=$1; shift
local start=$(date +%s%N)
local code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$@")
local ms=$(( ($(date +%s%N) - start) / 1000000 ))
if [ "$code" -ge 200 ] && [ "$code" -lt 400 ]; then
echo "$name:ok:${ms}ms"
else
echo "$name:FAIL:${code}" # ← this triggers alerts
fi
}
check "site" https://likeone.ai/
check "brain" -H "apikey: $KEY" "$URL/rest/v1/brain_context?limit=1"
check "edge" "$URL/functions/v1/founding-count"
check "academy" https://likeone.ai/academy/
Notice it checks HTTP status codes AND measures response time. A 200 that takes 30 seconds is still a problem — latency matters.
Restart vs. Rollback
These are the two primary healing actions, and choosing the wrong one makes things worse:
Clears a crashed or hung process. Resumes from current code. Use for: connection timeouts, memory leaks, stuck queues, process crashes. Does NOT fix bad code.
Reverts to a previous working code version. Use for: bad deploys, broken config changes, regressions. Does NOT fix infrastructure issues like network outages.
Auto-Healer Configuration
An auto-healer is a supervisor agent that monitors other agents and automatically fixes problems. A typical configuration specifies: which agents to watch (all, critical only, or a specific pipeline), how often to check (every 30 seconds to every 5 minutes), the default action on error (restart, rollback, or escalate), a maximum retry count to prevent restart loops, and an escalation channel (Slack, email, or both) for when automatic fixes fail.
This lesson is for Pro members
Unlock all 520+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.