📚Academy
likeone
online

Monitoring & Self-Healing

Systems that fix themselves before you even know something broke.

The sovereign stack does not page you at 3am. It detects problems, diagnoses root causes, applies fixes, and verifies recovery -- all autonomously. You wake up to a status report, not a crisis.

What you'll learn

  • Health check patterns: what to monitor and how
  • Auto-restart and recovery: bringing services back automatically
  • Proactive alerts: catching problems before they become crises
  • Graceful degradation: staying functional when components fail

The Monitoring Stack

Effective monitoring has three layers:

Health checks (is it alive?). Periodic pings to every service. Is Ollama responding? Is the brain database accessible? Is the website up? Binary alive/dead status, checked every 30-60 seconds.

Metrics (how is it performing?). Response times, CPU usage, RAM consumption, queue depths, error rates. Not just alive, but healthy. A service with 99% CPU usage is alive but about to crash.

Logs (what happened?). Structured logs from every service. When something goes wrong, logs tell you why. When the self-healer fixes something, logs prove it worked. Essential for debugging and audit trails.

Health Check System

// Health check runner -- checks all services every 60 seconds const services = [ { name: 'ollama', check: async () => { const res = await fetch('http://localhost:11434/api/tags'); return res.ok; }, restart: 'ollama serve', critical: true }, { name: 'brain', check: async () => { const result = brain.read('system.health_check'); return result !== null; }, restart: null, // Cannot restart a file -- alert instead critical: true }, { name: 'website', check: async () => { const res = await fetch('https://yourdomain.com', { timeout: 10000 }); return res.ok; }, restart: null, // Hosted externally -- alert critical: false } ]; async function runHealthChecks() { const results = []; for (const service of services) { try { const healthy = await service.check(); results.push({ name: service.name, healthy, timestamp: new Date() }); if (!healthy && service.restart) { console.log(`${service.name} unhealthy. Attempting restart...`); await exec(service.restart); // Wait 5 seconds, then re-check await wait(5000); const recovered = await service.check(); results[results.length - 1].recovered = recovered; } } catch (e) { results.push({ name: service.name, healthy: false, error: e.message }); } } // Write health status to brain brain.write('system.health_status', JSON.stringify(results), 'system'); return results; }

Self-Healing Patterns

Self-healing turns monitoring from passive observation into active repair:

Auto-restart. The simplest self-healing: if a service is down, restart it. Ollama crashed? Run ollama serve. Cron job stopped? Restart the scheduler. Most transient failures are fixed by a restart.

Exponential backoff restart. If restarting fails, do not hammer the service. Wait 10 seconds, then 30, then 60, then 5 minutes. After 5 failed restarts, stop trying and alert a human. The service has a deeper problem that automation cannot fix.

Resource-triggered healing. RAM above 90%? The self-healer kills the least-important process. CPU at 100% for 5 minutes? Pause non-critical background tasks. Disk above 95%? Clean old logs and temporary files. Prevention beats cure.

Service substitution. The primary Ollama instance is down? Route requests to a backup model. The primary database is unreachable? Switch to a read-only replica. Cloud API is rate-limited? Fall back to local model. The system degrades instead of breaking.

🔒

This lesson is for Pro members

Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai