RTMS on AWS: Terraform Template

Deploy a scalable Zoom RealTime Media Streams (RTMS) consumer to AWS in one command. Same shape as the rtms-quickstart-py repo — @rtms.on_webhook_event + rtms.run() packaged into a Fargate container, fronted by an ALB, and configured to auto-scale.

Architecture

Zoom ──webhook──▶  ALB (HTTPS, ACM cert)  ──▶  ECS Fargate Service
                                                ├── Task 1  (worker/main.py)
                                                ├── Task 2  (worker/main.py)
                                                └── Task N  (worker/main.py)
                                                            │
                                              ┌─────────────┼──────────────┐
                                              ▼             ▼              ▼
                                          S3 (JSONL)  CloudWatch     Secrets Manager

Each task runs worker/main.py, which uses @rtms.on_webhook_event to receive Zoom webhooks on port 8080 and rtms.EventLoopPool(threads=2) to host the active meetings. The ECS service auto-scales on CPU.

Prerequisites

Tools (one-time install):

brew install awscli terraform jq            # macOS
# Linux: use your package manager equivalents

Then aws configure (or aws sso login) once so the CLI is authenticated. No Docker required — the default worker_image pulls a pre-built public image.

Accounts / external assets:

AWS account with admin access. us-east-1 is the default deploy region.
A domain you control (Route 53 preferred, but any provider works — see DNS modes below).
Zoom Marketplace RTMS app with Client ID, Client Secret, and Webhook Secret Token. Events subscribed: endpoint.url_validation, meeting.rtms_started, meeting.rtms_stopped, meeting.rtms_interrupted.

Quick Start

git clone https://github.com/zoom/rtms-terraform-aws.git
cd rtms-terraform-aws
./deploy.sh

That's it. The script prompts for the values it needs, creates everything (Secrets Manager, Terraform state, infrastructure), and prints your webhook URL when it's done. End-to-end: ~10 minutes.

Re-running ./deploy.sh is safe — it preserves your terraform.tfvars, skips Secrets Manager entries and state buckets that already exist, and only terraform applys if there's drift.

What you'll be prompted for

Field	What
AWS region	Defaults to `us-east-1` (cheapest)
Project name	Resource prefix; defaults to `rtms-demo`
Webhook subdomain	e.g. `rtms.example.com` — the FQDN Zoom posts to
DNS mode	`route53` (recommended, fully automated) or `external` (you handle DNS)
Route 53 zone ID or ACM cert ARN	Depending on DNS mode (see below)
Zoom Client ID	From your Marketplace RTMS app
Zoom Client Secret	Same
Zoom Webhook Secret Token	Same
Budget alert email	For the AWS Budgets 80% notification

Prefer to run each step manually?

See docs/MANUAL_SETUP.md for the step-by-step walkthrough that deploy.sh automates.

DNS modes

Two ways to handle DNS for your webhook subdomain. deploy.sh asks at startup; you can also flip via the dns_mode tfvar.

Mode	When to use	What you provide	What Terraform creates
`route53` (default, recommended)	Your domain is hosted in Route 53	`route53_zone_id` + `webhook_domain`	ACM cert, DNS-validation CNAME, ALB ALIAS — fully automated, no manual DNS clicks
`external`	Your DNS is at Squarespace, Cloudflare, GoDaddy, etc.	`acm_certificate_arn` (you pre-issue + DNS-validate the cert) + `webhook_domain`	Nothing DNS-related. You manually point your DNS at the `alb_dns_name` output after apply.

How to find your Route 53 zone ID

aws route53 list-hosted-zones \
  --query "HostedZones[?Name=='example.com.'].Id" \
  --output text
# returns: /hostedzone/Z01234567ABCDEFGH  (deploy.sh strips the prefix)

External mode setup

Request the cert in the same region you'll deploy to:

aws acm request-certificate \
  --domain-name rtms.example.com \
  --validation-method DNS \
  --region us-east-1

AWS returns a CNAME record to add at your DNS provider. Add it.
Poll until Certificate.Status is "ISSUED".
Pass the cert ARN into deploy.sh when it prompts.
After terraform apply, create a CNAME at your DNS provider pointing your subdomain at the alb_dns_name output.

What's deployed

Resource	Purpose
VPC (2 AZs)	Networking foundation. Default profile uses public subnets for cost; flip `use_private_subnets = true` for production.
ALB	Public HTTPS endpoint. Terminates TLS via your ACM cert. Health check: `GET /` matcher `200-299` (worker returns 200 to GET via SDK monkey-patch — see DEVS-X9 v1.2 work).
ACM cert + Route 53 records (route53 mode only)	Issued, DNS-validated, and ALIAS-linked to the ALB automatically.
ECS cluster + Fargate service	Runs the worker container. Min 1 task, max 20 by default.
Auto-scaling policy	Target tracking on CPU at 50%.
S3 bucket	Transcript JSONL at `transcripts/<meeting_uuid>/<ns-epoch>.jsonl` (one PUT per chunk). SSE-S3, versioned, 30-day lifecycle to Glacier IR.
CloudWatch log group	`/aws/ecs/<project>-worker`, 30-day retention.
CloudWatch alarms	ALB 5xx > 5/min, no healthy hosts, ECS service unhealthy.
CloudWatch dashboard	ALB requests + 5xx, ECS CPU + memory, ALB healthy host count.
AWS Budgets alert	80% + 100% of `monthly_budget_usd` → email.
Secrets Manager references	Three pre-existing secrets injected into the task as env vars.

Outputs

After deploy.sh (or terraform apply) finishes:

Output	Use
`webhook_url`	Paste into Zoom Marketplace → Event Subscriptions
`alb_dns_name`	Point your DNS record here (external mode only)
`transcript_bucket`	Where `.jsonl` transcripts land
`ecs_cluster_name` / `ecs_service_name`	For `aws ecs` debugging
`log_group_name`	CloudWatch Logs Insights queries

Cost

State	Approx. monthly
Idle (0 meetings)	~~$20 — ALB (~~$16) + 1 Fargate Spot task (~~$2) + misc (~~$2)
10 concurrent × 1 hr/day	~$22
100 concurrent × 1 hr/day	~$30

A monthly_budget_usd tfvar wires up an AWS Budgets alert at 80%. Run bash scripts/teardown.sh when not testing — the ALB is the only resource that bills meaningfully when idle.

Low-cost test recipe

Fresh AWS account in us-east-1
max_capacity = 2, task_cpu = 256, task_memory = 512
terraform destroy immediately after each session
Realistic budget: ~$3 for a full end-to-end test pass

Local Development

cd worker
python3 -m venv .venv && source .venv/bin/activate   # or: uv venv && source .venv/bin/activate
pip install -r requirements-dev.txt                  # or: uv pip install -r requirements-dev.txt
pytest                                               # 48 tests, ~1 second

The same worker/main.py runs locally. Copy .env.example to .env and fill in your credentials. The file is annotated with # DEV: <value> comments next to the lines that should be flipped for local development — progressive SDK logs, DEBUG Python logs, on-disk transcripts, smaller pools, faster reconnect. Leave the production-shape values as-is when you want a CloudWatch-bound test run. .env is gitignored. In Fargate no .env file exists; secrets come from Secrets Manager.

.env files are local-only. In production (Fargate, EC2, Kubernetes, anywhere customer-facing), secrets must come from AWS Secrets Manager via the ECS task definition's secrets block — never baked into a Docker image, never copied onto a server, never committed to git. The Terraform in this repo wires Secrets Manager → ECS automatically; deploy.sh handles populating Secrets Manager from your input (or detects existing entries by name and re-uses them).

Point an ngrok tunnel at port 8080 to receive Zoom webhooks locally.

Test Plan

# Webhook URL validation (CRC handshake — no real meeting needed)
bash scripts/send-test-webhook.sh validation
# expect: HTTP 200 with {"plainToken": ..., "encryptedToken": ...}

# HMAC signature rejection
bash scripts/send-test-webhook.sh invalid
# expect: HTTP 401; check the WebhookSignatureFailed CloudWatch metric

# Real RTMS smoke — start a Zoom meeting with RTMS on, then:
aws logs tail /aws/ecs/rtms-demo-worker --follow
aws s3 ls s3://<transcript_bucket>/transcripts/ --recursive

# Autoscale check — synth load via a small loop of webhooks
bash scripts/send-test-webhook.sh burst 20
aws ecs describe-services --cluster rtms-demo-cluster --services rtms-demo-worker \
  --query 'services[0].desiredCount'

# Failover — stop a running task, watch ECS replace it within ~60s
aws ecs list-tasks --cluster rtms-demo-cluster
aws ecs stop-task --cluster rtms-demo-cluster --task <task-arn>

Teardown

bash scripts/teardown.sh

Force-empties the S3 bucket, runs terraform destroy, and verifies no orphaned ENIs remain.

Failure Modes & Trade-offs (v1)

This is a POC reference, not a production-hardened deployment. Key trade-offs you should know about:

What the worker handles (RTMS-level reconnection)

Per Zoom's RTMS reconnection contract, the worker handles all three failover scenarios using the SDK's events (not raw WebSockets):

Scenario	Trigger	Worker action
1. RTMS server failure	`meeting.rtms_started` arrives with same `rtms_stream_id` but different `server_urls`	Leave old client, join with new payload
2. Signal connection down	`meeting.rtms_interrupted` webhook	Leave old client, join with the webhook's fresh credentials
3. Media connection down only	SDK `on_media_connection_interrupted` callback	Re-issue `client.join(stored_payload)` with exponential backoff (3s/6s/12s/24s/30s) up to 5 attempts

Reference: zoom/rtms-samples/rtms_api/reconnection_and_chaos_mode_js — same contract, raw-WebSocket implementation. Our worker does it through SDK callbacks.

What AWS handles (infrastructure-level failover)

Task crash → ECS replaces the task within ~60s. In-flight meetings on the dead task are dropped (webhook is already ack'd, no replay). Future webhooks route to healthy tasks via the ALB.
AZ outage → ECS places replacement tasks in the healthy AZ.
Spot interruption → SIGTERM handler calls client.leave() on all active meetings inside the 30s Fargate stop grace; future webhooks land elsewhere.

Other v1 demo shortcuts

Public-subnet workers. Saves ~$32/mo NAT GW cost. Security group locks ingress to the ALB. For private subnets + NAT, set use_private_subnets = true.
Fargate Spot. ~70% cheaper, but tasks can be reclaimed with 2-minute notice. Mix Spot + on-demand via spot_weight / ondemand_weight for a stable baseline.
One S3 PUT per transcript chunk. Simple to reason about, but produces lots of small objects. Production fix: buffer ~30s in memory and PUT a multi-line JSONL, or stream to Kinesis Firehose.
All meeting state lives in worker memory. This is the most important trade-off to understand. The worker keeps three in-process maps for each active stream — the rtms.Client instances (clients), the cached webhook payloads used for reconnection (stream_payloads), and per-stream reconnect-attempt counters (reconnect_attempt). Plus the in-process webhook dedup table. All of these are lost on task restart. If a Fargate task crashes mid-meeting (Spot reclaim, OOM, AZ blip), the in-flight meeting drops — the next webhook for that stream lands on a fresh task with no record of the previous join, and Zoom's join signature is time-bound so re-joining isn't always possible. Production fix: SQS-replay architecture where webhook ingest is decoupled from the worker fleet (see spec.md → Future Work). Cross-task dedup would similarly need DynamoDB / ElastiCache — but the rtms SDK rejects duplicate joins anyway, so per-task dedup is belt-and-suspenders.

Zoom for Government

Set zoom_host = "https://zoomgov.com" in terraform.tfvars. The worker maps URLs accordingly. ALB / ECS / DNS are unchanged.

Troubleshooting

Zoom webhook validation fails: ZM_RTMS_WEBHOOK_SECRET in Secrets Manager must match the Marketplace app's Secret Token exactly. Tail logs: aws logs tail /aws/ecs/<project>-worker --follow.
ALB returns 502 / 503: target group health check failing. The ALB hits GET / on port 8080; matcher is 200-299 (the worker subclasses WebhookHandler to return 200 to GET — without that, the SDK returns 501 which can't match within ALB's allowed 200-499 range).
Worker can't pull image: task execution role missing ECR read, or the image doesn't exist at the URI in worker_image. Check the ECR repo + image tag.
Transcripts not appearing: on_join_confirm not firing. Usually a Zoom-side scope or RTMS-enablement issue; check the Marketplace app config and confirm meeting.rtms_started, meeting.rtms_stopped, meeting.rtms_interrupted are subscribed events.
failed to open file [/app/logs/...] errno = 2 in logs: rebuild the worker image — image tag 0.1.3 or later includes mkdir -p /app/logs in the Dockerfile. The C++ RTMS SDK writes internal log files to ./logs/ relative to CWD; the container's WORKDIR /app resolves that to /app/logs/ which must exist.
terraform destroy hangs on VPC: run bash scripts/teardown.sh — it force-deletes Fargate ENIs.

Layout

rtms-terraform-aws/
├── deploy.sh                # one-command guided setup
├── README.md                # this file
├── docs/
│   └── MANUAL_SETUP.md      # step-by-step walkthrough (what deploy.sh automates)
├── terraform.tfvars.example # all inputs documented
├── .env.example             # worker env-var contract (prod-shape with DEV: overrides)
├── main.tf / variables.tf / outputs.tf / versions.tf
├── modules/                 # network, alb, worker, storage, observability
├── worker/                  # Python RTMS consumer (Dockerfile + main.py + tests)
├── scripts/
│   ├── bootstrap-state.sh         # S3 backend setup
│   ├── build-and-push-worker.sh   # build + push worker image to ECR
│   ├── send-test-webhook.sh       # signed-webhook smoke harness
│   ├── teardown.sh                # destroy + ENI cleanup
│   └── validate-terraform.sh      # fmt + validate + tflint + checkov
└── .github/                 # dependabot + CI workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTMS on AWS: Terraform Template

Architecture

Prerequisites

Quick Start

What you'll be prompted for

Prefer to run each step manually?

DNS modes

How to find your Route 53 zone ID

External mode setup

What's deployed

Outputs

Cost

Low-cost test recipe

Local Development

Test Plan

Teardown

Failure Modes & Trade-offs (v1)

What the worker handles (RTMS-level reconnection)

What AWS handles (infrastructure-level failover)

Other v1 demo shortcuts

Zoom for Government

Troubleshooting

Layout

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
docs		docs
modules		modules
scripts		scripts
worker		worker
.env.example		.env.example
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
LICENSE.md		LICENSE.md
README.md		README.md
deploy.sh		deploy.sh
main.tf		main.tf
outputs.tf		outputs.tf
terraform.tfvars.example		terraform.tfvars.example
variables.tf		variables.tf
versions.tf		versions.tf

Folders and files

Latest commit

History

Repository files navigation

RTMS on AWS: Terraform Template

Architecture

Prerequisites

Quick Start

What you'll be prompted for

Prefer to run each step manually?

DNS modes

How to find your Route 53 zone ID

External mode setup

What's deployed

Outputs

Cost

Low-cost test recipe

Local Development

Test Plan

Teardown

Failure Modes & Trade-offs (v1)

What the worker handles (RTMS-level reconnection)

What AWS handles (infrastructure-level failover)

Other v1 demo shortcuts

Zoom for Government

Troubleshooting

Layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages