Running 500 tenant apps on one Nomad cluster

G7Cloud is a managed hosting platform. Customers ship WordPress, Next.js, Payload, Strapi, Ghost, Medusa, raw PHP or Node apps and we run them with a dedicated container per app, dedicated database, SSL, backups, and edge security, at a fixed monthly price. Today the platform runs hundreds of tenant applications with 99.99% uptime on infrastructure I architect and operate personally.

The stack is deliberately boring: Nomad, Consul, Traefik, Docker, on two VMs behind an edge router. This post is about why, and how the pieces fit.

Why not Kubernetes

Kubernetes is the default answer for multi-tenant workloads in 2026 and it's a perfectly good answer — for teams. For a platform I operate personally, the operational surface area is the enemy. The cluster itself shouldn't be a full-time job.

Nomad + Consul gets you 80% of what Kubernetes gets you at maybe 20% of the moving parts: one binary per role, a readable HCL job spec, first-class support for non-container workloads, and a control plane that boots from cold in seconds. For this shape of workload — many small, long-running apps with per-tenant isolation — the heavyweight-ness of k8s is paying for features we don't use.

The two-VM split

The platform runs across two machines plus a router. Each has one job.

Control plane

Consul server (cluster leader)
Nomad server
Docker daemon
Platform Traefik — TLS termination, routing, ACME
Authoritative DNS with an admin UI
Mail server
Control API, image builds, private registry
Operator dashboard

All of it systemd-managed, with a deliberate boot order: Docker first, then the platform Traefik (because it creates the external proxy network everything else attaches to), then everything else.

Tenant worker

Consul client, joining the control-plane server
Nomad client
Docker daemon
Worker Traefik, reading the Consul catalog
Shared MariaDB for tenants
Shared SFTP

Every tenant container is scheduled by Nomad onto the worker. The worker Traefik points at its local Consul agent, not at the control plane — so the control plane isn't a single point of failure for request routing.

Networking

Nomad uses CNI bridge networking. Dynamic ports in the 20000–32000 range are DNAT'd via iptables so the worker Traefik can reach any allocation on its dynamic port. CNI plugins live at /opt/cni/bin. Consul Connect is available but we don't need service mesh complexity for this shape of workload — direct routing is fine.

One gotcha worth calling out: after a reboot, Nomad allocations occasionally hold stale iptables rules that never get cleaned up. If something mysteriously stops routing, nomad alloc stop <id> forces a fresh schedule and the rules regenerate cleanly.

Deploys

The deploy dance for a control-plane service is always the same:

docker compose down
git pull
docker compose build --no-cache
docker compose up -d

Skipping down is the single most common way to get stale containers running against new code. It's baked into muscle memory now.

Why this holds up at scale

Three properties make this design work at hundreds of tenants:

Local-first control. Worker Traefik talks to the local Consul agent, tenant databases are on the worker, shared SFTP is on the worker. The control plane can be rebooted without tenants noticing.
Boring primitives. Nomad, Consul, Traefik and Docker all compose with systemd and a plain HCL file. When something goes wrong there's nothing magic to unpick.
Per-tenant container isolation. Each tenant app gets its own container, its own resource limits, and its own lifecycle. No shared PHP-FPM pools, no "noisy neighbour" surprises.

What I'd change

Two honest caveats.

The second worker is overdue. Horizontal scale on tenant workers is straightforward with Nomad but it's still a planned milestone rather than shipped.
The central Traefik vs. worker Traefik split is pragmatic, not elegant. Longer-term there's a cleaner design where public traffic lands on a dedicated edge tier and platform Traefik is purely internal.

None of this is novel tech — the interesting part is the restraint. Picking primitives you can operate alone, and refusing to introduce anything you can't debug at 3 a.m.