nscale — Nomad Scale-to-Zero

Transparent scale-to-zero and wake-on-request for HashiCorp Nomad services. nscale sits between Traefik and your Nomad jobs — when traffic arrives for a dormant service, it wakes the job, proxies the request, and scales idle services back to zero when they go quiet.

Architecture

graph LR
    Client([Client]) -->|request| Traefik

    subgraph Traefik["Traefik (reverse proxy)"]
        CC["ConsulCatalog route<br/>priority 30"]
        Fallback["s2z-fallback route<br/>priority 1"]
    end

    CC -->|"service=s2z-nscale#64;file"| nscale
    Fallback -->|"service=s2z-nscale"| nscale

    nscale -->|proxy| Backend["Nomad job<br/>(healthy allocation)"]
    nscale -->|scale up| Nomad
    nscale -->|health poll| Consul
    nscale -->|activity + registry| Redis

    Nomad -->|registers service| Consul
    Consul -->|service discovery| Traefik

Routing paths

All traffic for nscale-managed services flows through nscale on both the cold and warm paths. This ensures nscale has full visibility into in-flight requests and can prevent premature scale-down.

graph TD
    Request([Incoming request]) --> Traefik

    Traefik -->|"Service UP<br/>CC route (prio 30)"| nscale
    Traefik -->|"Service DOWN<br/>Fallback (prio 1)"| nscale

    nscale -->|Service is running| Proxy["Proxy to backend"]
    nscale -->|Service is dormant| Wake["Wake → poll Consul → proxy"]

    Proxy --> Backend["Nomad allocation"]
    Wake -->|scale up| Nomad["Nomad API"]
    Nomad --> Consul["Consul health check"]
    Consul -->|healthy| Proxy

Key insight: ConsulCatalog routes use traefik.http.routers.<name>.service=s2z-nscale@file to point back to nscale instead of directly to the Nomad allocation. This guarantees nscale tracks every request, enabling in-flight protection and heartbeat-based activity recording.

In-flight request protection

nscale prevents scale-down of services with active long-running requests using an InFlightTracker with RAII guards and periodic heartbeats:

sequenceDiagram
    participant Client
    participant Traefik
    participant nscale
    participant Backend
    participant Scaler

    Client->>Traefik: GET /slow?delay=60
    Traefik->>nscale: route via s2z-nscale@file
    nscale->>nscale: InFlightTracker.track(job_id)
    nscale->>nscale: record_activity(start)
    nscale->>Backend: proxy request

    loop Every idle_timeout / 3
        nscale->>nscale: heartbeat → record_activity
    end

    Note over Scaler: Scale-down tick
    Scaler->>nscale: has_in_flight(job_id)?
    nscale-->>Scaler: true → skip scale-down

    Backend-->>nscale: response (after 60s)
    nscale->>nscale: InFlightGuard dropped
    nscale->>nscale: record_activity(end)
    nscale-->>Client: response

    Note over Scaler: Next tick after idle_timeout
    Scaler->>nscale: has_in_flight(job_id)?
    nscale-->>Scaler: false
    Scaler->>Scaler: check traffic probe
    Scaler->>nscale: scale to zero

Scale-down decision flow

flowchart TD
    Tick["Scale-down tick<br/>(every scale_down_interval)"] --> Lock["Acquire distributed lock"]
    Lock --> Idle["Query Redis for idle jobs<br/>(activity score < now − idle_timeout)"]
    Idle -->|No idle jobs| Done["Sleep until next tick"]
    Idle -->|Found idle jobs| Loop["For each idle job"]
    Loop --> Deferred{"Deferred due to<br/>active deployment?"}
    Deferred -->|Yes, backoff active| Skip["Skip job"]
    Deferred -->|No| InFlight{"InFlightTracker<br/>has_in_flight?"}
    InFlight -->|Yes| Refresh["Refresh activity<br/>→ skip"]
    InFlight -->|No| Probe{"Traefik traffic probe<br/>request delta > 0?"}
    Probe -->|Active traffic| Refresh
    Probe -->|Probe error| Skip
    Probe -->|No traffic| Scale["Scale job to count=0"]
    Scale --> Cleanup["Mark dormant<br/>Remove activity<br/>Clear traffic baseline"]
    Refresh --> Loop
    Skip --> Loop
    Cleanup --> Loop

nscale is a single Rust binary composed of seven internal crates:

Crate	Purpose
`nscale-core`	Shared types, config (Figment), traits, `InFlightTracker`
`nscale-nomad`	Nomad API client — scale up/down, event stream
`nscale-consul`	Consul catalog — health checks, service discovery
`nscale-store`	Redis activity store (sorted set) and job registry
`nscale-proxy`	Reverse proxy with in-flight tracking and heartbeat
`nscale-waker`	Wake coordinator — request coalescing, state machine
`nscale-scaler`	Scale-down controller with traffic probe + in-flight guard

Features

Wake-on-request — Dormant services are started automatically when traffic arrives
Request coalescing — Concurrent requests for the same service share a single wake cycle
In-flight protection — RAII-based InFlightTracker with heartbeat prevents scale-down during active requests
Heartbeat activity — Long-running requests refresh activity every idle_timeout / 3, surviving any idle window
Reverse proxy — All requests (cold and warm path) route through nscale for full visibility
Idle detection — Services with no recent activity are scaled to zero via Redis sorted set
Traffic probe — Scrapes Traefik Prometheus metrics as a secondary guard against scaling down active services
Retry with cache invalidation — On upstream failure, invalidates stale endpoints and retries the full wake cycle
Nomad event stream — Reacts to allocation lifecycle events for instant state transitions
Active-deployment tolerance — Gracefully handles Nomad 400 "scaling blocked due to active deployment"
Bounded concurrency — Configurable limit on simultaneous Nomad scale operations

Additional operator docs live in docs/, starting with the performance-configuration.md guide and the job-submission.md guide for the new admin submission flow.

Quick Start

Prerequisites

Docker and Docker Compose
Nomad 1.10+
Consul 1.18+

Run with Docker Compose

The integration stack brings up Nomad, Consul, Redis, Traefik, and nscale:

cd integration
bash traefik/certs/generate.sh
docker compose up -d

Submit a sample job through nscale:

# Build a safe JSON payload from the sample HCL fixture
curl -X POST http://localhost:9090/admin/jobs \
  -H 'Content-Type: application/json' \
  --data "$(jq -n \
    --rawfile hcl jobs/echo-submit.nomad \
    --arg variables $'service_name = \"echo-s2z\"\nhost_name = \"echo-s2z.localhost\"' \
    '{hcl: $hcl, variables: $variables}')"

Send a request — nscale wakes the service and proxies the response:

curl -H "Host: echo-s2z.localhost" http://localhost:80/

# HTTPS works too (Traefik terminates TLS; -k accepts the local self-signed cert)
curl -k --resolve 'echo-s2z.localhost:443:127.0.0.1' https://echo-s2z.localhost/

Build from source

cargo build --release
./target/release/nscale

Docker

docker build -t nscale .
docker run -p 8080:8080 -p 9090:9090 nscale

Configuration

nscale uses Figment for layered configuration: Environment variables > TOML file > Defaults.

TOML (`config/default.toml`)

[default]
listen_addr = "0.0.0.0:8080"
admin_addr  = "0.0.0.0:9090"

[default.nomad]
addr        = "http://localhost:4646"
concurrency = 50

[default.consul]
addr = "http://localhost:8500"

[default.redis]
url = "redis://localhost:6379"

[default.scaling]
idle_timeout_secs        = 300
wake_timeout_secs        = 60
scale_down_interval_secs = 30
min_scale_down_age_secs  = 120

[default.proxy]
request_timeout_secs = 30
request_buffer_size  = 1000

[default.routing]
file_provider_service = "s2z-nscale@file"

Environment variables

All settings can be overridden with NSCALE_ prefixed env vars. Nested keys use double underscores (Figment .split("__")).

Variable	Default	Description
`NSCALE_LISTEN_ADDR`	`0.0.0.0:8080`	Proxy listen address
`NSCALE_ADMIN_ADDR`	`0.0.0.0:9090`	Admin/health listen address
`NSCALE_NOMAD__ADDR`	`http://localhost:4646`	Nomad API address
`NSCALE_NOMAD__TOKEN`	—	Nomad ACL token (optional)
`NSCALE_NOMAD__CONCURRENCY`	`50`	Max concurrent Nomad operations
`NSCALE_CONSUL__ADDR`	`http://localhost:8500`	Consul API address
`NSCALE_CONSUL__TOKEN`	—	Consul ACL token (optional)
`NSCALE_REDIS__URL`	`redis://localhost:6379`	Redis connection URL
`NSCALE_SCALING__IDLE_TIMEOUT_SECS`	`300`	Seconds before idle service is scaled down
`NSCALE_SCALING__WAKE_TIMEOUT_SECS`	`60`	Max seconds to wait for a service to become healthy
`NSCALE_SCALING__SCALE_DOWN_INTERVAL_SECS`	`30`	Scale-down sweep interval
`NSCALE_SCALING__MIN_SCALE_DOWN_AGE_SECS`	`120`	Minimum job age before scale-down eligibility (reserved for future runtime wiring)
`NSCALE_PROXY__REQUEST_TIMEOUT_SECS`	`30`	Upstream request timeout
`NSCALE_PROXY__REQUEST_BUFFER_SIZE`	`1000`	Request wake buffer size (reserved for future runtime wiring)
`NSCALE_ROUTING__FILE_PROVIDER_SERVICE`	`s2z-nscale@file`	Traefik service target injected into router tags during `/admin/jobs` submissions
`NSCALE_TRAEFIK__METRICS_URL`	—	Traefik Prometheus endpoint (enables traffic probe)
`NSCALE_TRAEFIK__PROVIDER`	—	Traefik provider name for metric labels
`RUST_LOG`	`info,nscale=debug`	Tracing filter

Traefik Integration

nscale requires specific Traefik configuration to ensure all traffic flows through it.

Static config (`traefik.yml`)

providers:
  file:
    filename: /etc/traefik/dynamic.yml
  consulCatalog:
    exposedByDefault: false
    strictChecks:
      - "passing"
      - "warning"

Important: strictChecks must be a list of health status strings, not a boolean. Setting strictChecks: true is silently interpreted as the literal string "true", causing all ConsulCatalog services to be rejected.

Dynamic config (`dynamic.yml`)

http:
  routers:
    s2z-fallback:
      rule: "HostRegexp(`^[a-z0-9-]+\\.localhost$`)"
      priority: 1
      entryPoints: [http]
      service: s2z-nscale

    s2z-fallback-https:
      rule: "HostRegexp(`^[a-z0-9-]+\\.localhost$`)"
      priority: 1
      entryPoints: [https]
      tls: {}
      service: s2z-nscale

  services:
    s2z-nscale:
      loadBalancer:
        passHostHeader: true
        servers:
          - url: "http://nscale:8080"

tls:
  certificates:
    - certFile: /etc/traefik/certs/server.crt
      keyFile: /etc/traefik/certs/server.key

TLS / HTTPS

nscale itself still speaks plain HTTP on the inside. HTTPS support is provided by Traefik TLS termination:

clients connect to Traefik on :443
Traefik terminates TLS and forwards to http://nscale:8080
nscale still proxies to plain-HTTP Nomad allocations unless your deployment adds a separate upstream TLS layer

For cold-start over HTTPS to work, every TLS entrypoint must also have a fallback file-provider router pointing at s2z-nscale. Otherwise warm traffic may work while cold HTTPS requests return Traefik 404 before nscale can wake the job.

The integration stack ships a local self-signed certificate for localhost / *.localhost via integration/traefik/certs/generate.sh. Production deployments should replace that with their own certificate or ACME resolver setup.

Nomad job tags

Services must route through nscale on both cold and warm paths. If you submit jobs directly to Nomad, include service=s2z-nscale@file yourself. If you submit through /admin/jobs, nscale injects or overrides this tag automatically for every Traefik-enabled service that already declares explicit router tags.

Use service=s2z-nscale@file to point the ConsulCatalog router at nscale:

service {
  name     = "my-service"
  provider = "consul"
  port     = "http"

  tags = [
    "traefik.enable=true",
    "traefik.http.routers.my-service.rule=Host(`my-service.localhost`)",
    "traefik.http.routers.my-service.entryPoints=http,https",
    "traefik.http.routers.my-service.tls=true",
    "traefik.http.routers.my-service.service=s2z-nscale@file",
  ]

  check {
    type     = "http"
    path     = "/"
    interval = "2s"
    timeout  = "1s"
  }
}

This creates two routes to nscale:

Route	Provider	Priority	When active
`my-service@consulcatalog`	ConsulCatalog	30	Service is running (healthy in Consul)
`s2z-fallback@file`	File	1	Always (catches dormant services)

Job submission via `/admin/jobs`

The admin submission endpoint lets nscale own the full registration flow:

parse Nomad HCL with optional variables through Nomad's parser
inject or override traefik.http.routers.<name>.service=s2z-nscale@file
submit the mutated job to Nomad
auto-register every managed service in Redis
seed initial activity so the scaler can safely discover the job later

The endpoint only manages services that:

set traefik.enable=true
include at least one explicit router tag like traefik.http.routers.api.rule=...

Router TLS tags such as traefik.http.routers.api.entryPoints=http,https and traefik.http.routers.api.tls=true are preserved exactly as submitted. nscale only injects or overrides the .service=s2z-nscale@file target.

Non-Traefik services are ignored. Traefik-enabled services without explicit router tags are rejected, because nscale has no router name to target for the injected .service= override.

Admin API

Method	Path	Description
`GET`	`/healthz`	Liveness check
`GET`	`/readyz`	Readiness check (verifies Redis)
`POST`	`/admin/jobs`	Parse HCL, inject required Traefik routing tags, submit to Nomad, auto-register managed services
`POST`	`/admin/registry`	Register a single job (seeds activity)
`POST`	`/admin/registry/sync`	Bulk-sync all job registrations (seeds activity)

Job submission payload

{
  "hcl": "job \"echo-submit-job\" { ... }",
  "variables": "service_name = \"echo-s2z\"\nhost_name = \"echo-s2z.localhost\""
}

variables is optional. When present, it is passed straight through to Nomad's HCL parser. Keep variable interpolation inside attribute values — Nomad does not allow template expressions inside block labels such as job "${var.name}".

Successful responses include the Nomad evaluation information plus the set of managed services that nscale registered automatically.

Manual registration payload

{
  "job_id": "my-service",
  "service_name": "my-service",
  "nomad_group": "main"
}

Registration seeds an activity timestamp in Redis so the scaler can detect the job as idle once idle_timeout expires. Without this, registered jobs would never be discovered by the scale-down controller.

This endpoint is still useful when jobs are submitted outside nscale and you only need to register an already-known service.

Testing

Unit tests

cargo nextest run --workspace

Integration / stress tests (k6)

cd integration

# Multi-service chaos (50 services + random kills)
docker compose --profile stress run --rm \
  -e NSCALE_JOB_COUNT=50 \
  -e NSCALE_DURATION=90s \
  k6 run /scripts/multi-service.js

Admin submission flow

cd integration
./test.sh

The main integration script now submits jobs/echo-submit.nomad through /admin/jobs, verifies that nscale injected the Traefik service override tag in Consul, confirms lookup still works when service_name differs from job_id, and then exercises wake-up and automatic scale-down.

Multi-job management

cd integration

# Submit and register 50 jobs
bash scripts/multi-job.sh submit 50

# Check status
bash scripts/multi-job.sh status 50

# Teardown
bash scripts/multi-job.sh teardown 50

How It Works

1. Submission and registration

A Nomad job can be submitted through /admin/jobs. nscale parses the HCL via Nomad, injects the Traefik router service override, submits the mutated job, stores the managed services in Redis, and seeds an initial activity timestamp. The job can be at any count — nscale will scale it down once idle_timeout expires if there's no traffic.

If you submit jobs outside nscale, use /admin/registry or /admin/registry/sync to add the same registration data manually.

2. Cold-start wake

When all allocations are stopped, the ConsulCatalog route disappears. Traefik falls through to the s2z-fallback route (priority 1) → nscale. nscale looks up the job in its registry, calls Nomad to scale up, polls Consul for a healthy endpoint, then proxies the original request.

3. Warm-path proxy

When the service is healthy, Traefik creates a ConsulCatalog route (priority 30) that also points to s2z-nscale@file. The request still flows through nscale, which tracks it with InFlightTracker, spawns a heartbeat, and proxies to the real backend using the coordinator's cached endpoint (resolved from Consul during wake).

4. In-flight protection

Each proxied request increments an atomic counter via InFlightTracker.track(), returning an RAII guard that decrements on drop. A background heartbeat task refreshes activity in Redis every idle_timeout / 3. The scale-down controller checks has_in_flight() before any scale operation — if requests are active, it refreshes activity and skips.

5. Scale down

The scale-down controller runs a periodic sweep:

Acquire distributed lock in Redis
Query the activity sorted set for jobs with score < now - idle_timeout
For each idle job, check deployment deferral (skip if Nomad reported active deployment recently)
Check InFlightTracker (in-flight requests block scale-down and refresh activity)
Check Traefik traffic probe (request counter delta; fail-open on probe errors)
If truly idle, scale Nomad job to count = 0
Mark dormant, remove activity, and clear the traffic probe baseline

6. Nomad event stream

nscale subscribes to Nomad's allocation event stream. When allocations transition to running, it records activity. When they stop, it marks the job dormant and clears the coordinator cache — enabling instant state transitions without polling.

Project Structure

├── Cargo.toml              # Workspace + binary definition
├── docs/                   # Operator-facing guides and configuration docs
├── src/main.rs             # Binary entrypoint
├── crates/
│   ├── nscale-core/        # Config, traits, shared types
│   ├── nscale-nomad/       # Nomad API client
│   ├── nscale-consul/      # Consul catalog client
│   ├── nscale-store/       # Redis activity store + job registry
│   ├── nscale-proxy/       # Reverse proxy + activity middleware
│   ├── nscale-waker/       # Wake coordinator (state machine)
│   └── nscale-scaler/      # Scale-down controller + traffic probe
├── config/
│   └── default.toml        # Default configuration
├── integration/
│   ├── docker-compose.yml  # Full local stack
│   ├── jobs/               # Sample Nomad job specs
│   ├── k6/                 # Stress & chaos test scripts
│   ├── scripts/            # Helper scripts
│   └── traefik/            # Traefik configuration
├── Dockerfile              # Multi-stage production build
├── Dockerfile.release      # Lean release image using pre-built binaries
├── CHANGELOG.md
├── CONTRIBUTING.md
└── LICENSE                 # Apache 2.0

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
config		config
crates		crates
docs		docs
integration		integration
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.release		Dockerfile.release
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

nscale — Nomad Scale-to-Zero

Architecture

Routing paths

In-flight request protection

Scale-down decision flow

Features

Quick Start

Prerequisites

Run with Docker Compose

Build from source

Docker

Configuration

TOML (config/default.toml)

Environment variables

Traefik Integration

Static config (traefik.yml)

Dynamic config (dynamic.yml)

TLS / HTTPS

Nomad job tags

Job submission via /admin/jobs

Admin API

Job submission payload

Manual registration payload

Testing

Unit tests

Integration / stress tests (k6)

Admin submission flow

Multi-job management

How It Works

1. Submission and registration

2. Cold-start wake

3. Warm-path proxy

4. In-flight protection

5. Scale down

6. Nomad event stream

Project Structure

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 2

Languages

TOML (`config/default.toml`)

Static config (`traefik.yml`)

Dynamic config (`dynamic.yml`)

Job submission via `/admin/jobs`

Packages