Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Software & Platforms 13 min read

Openclaw High Availability: Redundancy and Failover Configuration

Openclaw High Availability: Redundancy and Failover Configuration

A single Openclaw instance on a VPS handles most workloads fine, right up until it doesn’t. The gateway process crashes at 2 AM, the Docker container runs out of memory during a scheduled cron burst, or your hosting provider reboots the VM for maintenance. If your business relies on Openclaw to route customer messages, trigger workflows, or run time-sensitive automations, that unplanned downtime costs you more than the infrastructure to prevent it.

The pattern that works for near-continuous uptime is not complicated: automatic process restarts catch the easy failures, multiple instances behind a load balancer handle the rest, and monitoring alerts tell you when something needs human attention before users notice. This guide walks through each layer with working configuration files.

Automatic Restart: The First Layer of Resilience

Before running multiple instances, make sure a single instance recovers from crashes on its own. This is the highest-impact, lowest-effort HA measure, and it catches roughly 80% of real-world outages: OOM kills, unhandled exceptions, and transient network errors that crash the gateway process.

Systemd Watchdog Configuration

If you run Openclaw directly on a Linux host (not in Docker), a systemd unit file with restart policies and a watchdog timer is the most reliable approach. Most guides suggest a basic Restart=always, but that misses the watchdog, which catches hung processes that are still technically alive but not responding.

[Unit]
Description=Openclaw Gateway
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=openclaw
WorkingDirectory=/opt/openclaw
ExecStart=/usr/bin/node gateway/index.js
Restart=on-failure
RestartSec=5
WatchdogSec=60
StartLimitIntervalSec=300
StartLimitBurst=5

Environment=NODE_ENV=production
Environment=OPENCLAW_PORT=18789
EnvironmentFile=/opt/openclaw/.env

[Install]
WantedBy=multi-user.target

The key settings here:

  • RestartSec=5 adds a 5-second delay between restarts so you do not hammer a failing dependency in a tight loop.
  • WatchdogSec=60 kills and restarts the process if it stops sending heartbeats for 60 seconds. This catches the silent-hang failure mode that Restart=on-failure alone misses.
  • StartLimitBurst=5 with a 300-second interval prevents infinite restart loops. If the gateway crashes 5 times in 5 minutes, systemd stops trying and you get an alert instead.

Enable it:

sudo systemctl daemon-reload
sudo systemctl enable --now openclaw-gateway

Docker Restart Policies

If you run Openclaw in Docker (which our Docker deployment guide covers in detail), the equivalent is a restart policy combined with a health check in your docker-compose.yml:

services:
  openclaw:
    image: openclaw/openclaw:latest
    restart: unless-stopped
    ports:
      - "18789:18789"
    env_file: .env
    volumes:
      - openclaw_data:/app/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:18789/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    deploy:
      resources:
        limits:
          memory: 2G

The restart: unless-stopped policy restarts the container after crashes and after host reboots, but respects manual docker stop commands. The health check is what makes this production-grade: Docker marks the container as unhealthy after 3 failed checks, and orchestrators like Docker Swarm or Kubernetes use that signal to replace the container automatically.

One important detail: set start_period to at least 30-40 seconds. Openclaw’s gateway takes time to initialize channel connections and load memory backends. Without a start period, Docker marks a healthy but still-booting container as unhealthy and restarts it in a loop.

Running Multiple Openclaw Instances

Automatic restarts handle process failures, but they cannot protect against host-level failures (VM crash, network partition, kernel panic) or eliminate downtime during restarts. For that, you need multiple instances.

How Many Instances

Three is the practical minimum for high availability. With three instances:

  • One can be down for maintenance or failure while two continue serving traffic.
  • You avoid the split-brain problem that plagues two-node setups.
  • Rolling updates replace one instance at a time without dropping below two active nodes.

Two instances is better than one, but during a rolling update or failure event you are back to a single instance with zero redundancy. For production workloads, three is recommended.

The Cron Leader Problem

Openclaw runs scheduled jobs (cron tasks) inside the gateway process. If you run three identical instances, you get three copies of every cron job firing simultaneously. This causes duplicate messages, repeated API calls, and race conditions in state files.

The fix is a cron leader election. Designate one instance as the cron leader and disable cron on the others:

# Instance 1 (leader): cron enabled
OPENCLAW_SKIP_CRON=0

# Instance 2 and 3: cron disabled
OPENCLAW_SKIP_CRON=1

If you are on Kubernetes, implement this with a ConfigMap per replica or use a Redis-based distributed lock. Our Openclaw cron jobs guide covers the scheduling mechanics in depth.

The downside of static leader election: if the leader dies, cron jobs stop until it recovers or you manually promote another instance. For most teams, this is acceptable because cron jobs are typically tolerant of short delays. If you need automatic leader failover for cron, you will need a distributed lock service like Redis or etcd.

Load Balancing Across Instances

With multiple instances running, you need a load balancer to distribute incoming traffic and route around failed instances. Nginx is the most common choice, and the configuration takes about 10 minutes.

Nginx Upstream Configuration

upstream openclaw_backend {
    server 10.0.1.10:18789 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:18789 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:18789 max_fails=3 fail_timeout=30s;
}

server {
    listen 443 ssl;
    server_name openclaw.yourdomain.com;

    ssl_certificate     /etc/ssl/certs/openclaw.pem;
    ssl_certificate_key /etc/ssl/private/openclaw.key;

    location / {
        proxy_pass http://openclaw_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_next_upstream error timeout http_502 http_503;
        proxy_next_upstream_timeout 10s;
        proxy_next_upstream_tries 2;
    }

    location /health {
        proxy_pass http://openclaw_backend;
        access_log off;
    }
}

The critical settings:

  • max_fails=3 fail_timeout=30s removes an upstream server from the pool after 3 consecutive failures, and re-checks it after 30 seconds. This is Nginx’s passive health checking.
  • proxy_next_upstream automatically retries the request on a different instance if the first one returns a 502, 503, or times out. The user never sees the error.
  • proxy_next_upstream_tries=2 limits retries to 2 attempts so a cascading failure does not amplify load.

If you use Caddy instead of Nginx, the equivalent is simpler but offers the same passive health checking:

openclaw.yourdomain.com {
    reverse_proxy 10.0.1.10:18789 10.0.1.11:18789 10.0.1.12:18789 {
        health_uri /health
        health_interval 15s
        fail_duration 30s
    }
}

Webhook and Channel Routing

For messaging channels (Telegram, Discord, WhatsApp), webhooks arrive at a single URL. The load balancer distributes them across instances, which means any instance might handle any conversation. This works if all instances share the same state backend (see next section). If they do not, webhook messages can arrive at an instance that lacks the conversation context, producing confused or repeated responses.

For WhatsApp specifically, session credentials are tied to a single process. You need a shared persistent volume for the WhatsApp session files, or use the official WhatsApp Business API which is stateless. Our WhatsApp integration guide covers this distinction.

State Synchronization

The Openclaw gateway is not stateless. It maintains conversation memory, cron job state in jobs.json, channel credentials, and skill configurations on disk. Running multiple instances that each write to isolated local storage causes state divergence. Instance A processes a message and updates memory. Instance B handles the next message in the same conversation but has stale memory. The response makes no sense.

Shared Storage Options

The simplest approach: mount a shared filesystem across all instances.

OptionBest ForWatch Out For
NFSSmall clusters (2-4 nodes)File locking is advisory, not enforced. Concurrent writes to jobs.json can corrupt the file.
CephFSLarger clusters with strong consistency needsOperational complexity; needs 3 monitor nodes minimum
Cloud-managed (EFS, Filestore)Cloud deploymentsLatency adds 1-5ms per file operation; can slow gateway startup

For most teams running 3 instances on a single cloud provider, NFS or the managed equivalent (AWS EFS, GCP Filestore) works. The main gotcha is concurrent file writes. Openclaw’s gateway writes to jobs.json, memory files, and credential stores. Two instances writing the same file simultaneously can produce corrupted JSON.

This can break in production. The fix is simple but not obvious: use the OPENCLAW_SKIP_CRON=1 flag on non-leader instances (which eliminates most concurrent writes to jobs.json) and configure a dedicated memory backend like Mem0 or QMD instead of file-based memory. Database-backed memory handles concurrent access natively.

For more on memory configuration, see our Openclaw memory configuration guide.

Health Checks and Monitoring

Running multiple instances behind a load balancer only helps if you know when something fails. Health checks serve two audiences: the load balancer (for automatic traffic rerouting) and your operations team (for alerting).

Application Health Endpoint

Openclaw exposes a /health endpoint on the gateway port. A basic check verifies the HTTP response:

curl -sf http://localhost:18789/health || echo "UNHEALTHY"

For deeper health validation, check that the gateway can reach its configured LLM provider:

#!/bin/bash
# openclaw-health-deep.sh
HEALTH=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:18789/health)
if [ "$HEALTH" != "200" ]; then
    echo "CRITICAL: Gateway health check failed (HTTP $HEALTH)"
    exit 1
fi
echo "OK: Gateway healthy"
exit 0

Monitoring with Prometheus

If you run Prometheus (and you should if you are serious about HA), add a blackbox exporter probe for each Openclaw instance:

# prometheus.yml
scrape_configs:
  - job_name: 'openclaw-health'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - http://10.0.1.10:18789/health
          - http://10.0.1.11:18789/health
          - http://10.0.1.12:18789/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Pair this with an alert rule that fires when any instance is down for more than 2 minutes:

# alert.rules.yml
groups:
  - name: openclaw
    rules:
      - alert: OpenclawInstanceDown
        expr: probe_success{job="openclaw-health"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Openclaw instance {{ $labels.instance }} is down"

Route the alert to Slack, PagerDuty, or email via Alertmanager. The goal is to know about failures within 2-3 minutes, well before your users report them.

For teams that do not run Prometheus, a simple cron-based check works as a starting point:

# /etc/cron.d/openclaw-monitor
*/2 * * * * root curl -sf http://localhost:18789/health || \
  curl -X POST -d '{"text":"Openclaw health check failed on $(hostname)"}' \
  https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Our Openclaw logging and debugging guide covers log aggregation and structured logging for multi-instance setups.

Putting It All Together

Here is the decision matrix for choosing your HA strategy based on your actual requirements:

RequirementSingle + Auto-RestartMulti-Instance + LBKubernetes
Recover from process crashesYesYesYes
Survive host failureNoYesYes
Zero-downtime updatesNoYes (rolling)Yes (rolling)
Auto-scalingNoManualYes (HPA)
Operational complexityLowMediumHigh
Monthly cost (3 instances)$5-15 (1 VPS)$15-45 (3 VPS)$50+ (cluster)

Most teams should start with a single instance plus systemd or Docker auto-restart. That alone eliminates the majority of outages. When your uptime requirement exceeds 99.5%, or when you cannot tolerate 30-60 seconds of downtime during restarts, move to the multi-instance setup described in this guide.

For teams considering Kubernetes, our enterprise deployment guide covers the orchestration layer in detail.

Frequently Asked Questions

How many Openclaw instances do I need for high availability?

Three instances is the recommended minimum. This allows one instance to be down (failure or maintenance) while two continue handling traffic. Two instances provides some redundancy but leaves you with a single point of failure during updates or pod failures.

Can I use Docker Compose for Openclaw HA instead of Kubernetes?

Docker Compose works for multi-instance setups on a single host, but it cannot span multiple servers or handle automatic failover across machines. For single-server redundancy (protecting against process crashes, not host failures), Docker Compose with multiple service replicas and a local Nginx load balancer is a practical and simpler alternative to Kubernetes.

How do I prevent duplicate cron jobs when running multiple instances?

Set OPENCLAW_SKIP_CRON=1 on all instances except one designated cron leader. The leader handles all scheduled tasks. If you need automatic leader failover, implement a Redis-based distributed lock where instances compete for a lock key before executing cron jobs.

What happens to active conversations when an Openclaw instance restarts?

If you use shared storage or a database-backed memory service, conversation context persists across restarts. The load balancer routes the next message to a healthy instance, which reads the shared memory and continues the conversation. Without shared state, the replacement instance starts with no context for in-progress conversations.

Does Openclaw support model failover automatically?

Yes. Openclaw’s built-in model failover rotates through configured auth profiles when one hits rate limits or errors, with cooldown intervals of 1, 5, 25, and 60 minutes. If all profiles for a model are exhausted, it falls through to the next model in your fallbacks configuration. This is separate from infrastructure-level HA and works even on a single instance. See our multi-model configuration guide for setup details.

What is the best load balancer for multiple Openclaw instances?

Nginx and Caddy are the most common choices. Nginx offers fine-grained control over upstream health checking, retry behavior, and connection limits. Caddy provides automatic HTTPS and simpler configuration. Both support passive health checks that remove failed instances from the pool. For cloud deployments, managed load balancers (AWS ALB, GCP Load Balancer) reduce operational overhead.

How much additional cost does an HA setup add?

A basic 3-instance setup on budget VPS providers runs $15-45 per month total (three 2GB RAM instances at $5-15 each). This is 3x the cost of a single instance, but for business-critical deployments the cost of an hour of downtime typically exceeds several months of infrastructure spend.

Key Takeaways

  • Start with automatic restarts (systemd or Docker restart policies) before adding more instances. This single change prevents most unplanned outages.
  • Three instances behind a load balancer is the minimum for true high availability that survives host failures.
  • Shared storage or a database-backed memory service is required for multi-instance deployments. Without it, conversations break across instance boundaries.
  • Designate one instance as the cron leader to prevent duplicate scheduled jobs.
  • Monitor every instance with health checks and route alerts to your team. HA infrastructure that fails silently is worse than a single instance you watch closely.

If your team needs Openclaw running with minimal downtime but setting up the infrastructure feels like more than you want to manage, SFAI Labs handles managed Openclaw deployments with built-in redundancy, monitoring, and support.

Last Updated: Apr 24, 2026

SL

SFAI Labs

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

Get OpenClaw Running — Without the Headaches

  • End-to-end setup: hosting, integrations, and skills
  • Skip weeks of trial-and-error configuration
  • Ongoing support when you need it
Get OpenClaw Help →
From zero to production-ready in days, not weeks

Related articles