Deploy a scalable Zoom RealTime Media Streams (RTMS) consumer to AWS in one command. Same shape as the rtms-quickstart-py repo — @rtms.on_webhook_event + rtms.run() packaged into a Fargate container, fronted by an ALB, and configured to auto-scale.
Zoom ──webhook──▶ ALB (HTTPS, ACM cert) ──▶ ECS Fargate Service
├── Task 1 (worker/main.py)
├── Task 2 (worker/main.py)
└── Task N (worker/main.py)
│
┌─────────────┼──────────────┐
▼ ▼ ▼
S3 (JSONL) CloudWatch Secrets Manager
Each task runs worker/main.py, which uses @rtms.on_webhook_event to receive Zoom webhooks on port 8080 and rtms.EventLoopPool(threads=2) to host the active meetings. The ECS service auto-scales on CPU.
Tools (one-time install):
brew install awscli terraform jq # macOS
# Linux: use your package manager equivalentsThen aws configure (or aws sso login) once so the CLI is authenticated. No Docker required — the default worker_image pulls a pre-built public image.
Accounts / external assets:
- AWS account with admin access.
us-east-1is the default deploy region. - A domain you control (Route 53 preferred, but any provider works — see DNS modes below).
- Zoom Marketplace RTMS app with Client ID, Client Secret, and Webhook Secret Token. Events subscribed:
endpoint.url_validation,meeting.rtms_started,meeting.rtms_stopped,meeting.rtms_interrupted.
git clone https://github.com/zoom/rtms-terraform-aws.git
cd rtms-terraform-aws
./deploy.shThat's it. The script prompts for the values it needs, creates everything (Secrets Manager, Terraform state, infrastructure), and prints your webhook URL when it's done. End-to-end: ~10 minutes.
Re-running ./deploy.sh is safe — it preserves your terraform.tfvars, skips Secrets Manager entries and state buckets that already exist, and only terraform applys if there's drift.
| Field | What |
|---|---|
| AWS region | Defaults to us-east-1 (cheapest) |
| Project name | Resource prefix; defaults to rtms-demo |
| Webhook subdomain | e.g. rtms.example.com — the FQDN Zoom posts to |
| DNS mode | route53 (recommended, fully automated) or external (you handle DNS) |
| Route 53 zone ID or ACM cert ARN | Depending on DNS mode (see below) |
| Zoom Client ID | From your Marketplace RTMS app |
| Zoom Client Secret | Same |
| Zoom Webhook Secret Token | Same |
| Budget alert email | For the AWS Budgets 80% notification |
See docs/MANUAL_SETUP.md for the step-by-step walkthrough that deploy.sh automates.
Two ways to handle DNS for your webhook subdomain. deploy.sh asks at startup; you can also flip via the dns_mode tfvar.
| Mode | When to use | What you provide | What Terraform creates |
|---|---|---|---|
route53 (default, recommended) |
Your domain is hosted in Route 53 | route53_zone_id + webhook_domain |
ACM cert, DNS-validation CNAME, ALB ALIAS — fully automated, no manual DNS clicks |
external |
Your DNS is at Squarespace, Cloudflare, GoDaddy, etc. | acm_certificate_arn (you pre-issue + DNS-validate the cert) + webhook_domain |
Nothing DNS-related. You manually point your DNS at the alb_dns_name output after apply. |
aws route53 list-hosted-zones \
--query "HostedZones[?Name=='example.com.'].Id" \
--output text
# returns: /hostedzone/Z01234567ABCDEFGH (deploy.sh strips the prefix)- Request the cert in the same region you'll deploy to:
aws acm request-certificate \ --domain-name rtms.example.com \ --validation-method DNS \ --region us-east-1
- AWS returns a CNAME record to add at your DNS provider. Add it.
- Poll until
Certificate.Statusis"ISSUED". - Pass the cert ARN into
deploy.shwhen it prompts. - After
terraform apply, create a CNAME at your DNS provider pointing your subdomain at thealb_dns_nameoutput.
| Resource | Purpose |
|---|---|
| VPC (2 AZs) | Networking foundation. Default profile uses public subnets for cost; flip use_private_subnets = true for production. |
| ALB | Public HTTPS endpoint. Terminates TLS via your ACM cert. Health check: GET / matcher 200-299 (worker returns 200 to GET via SDK monkey-patch — see DEVS-X9 v1.2 work). |
| ACM cert + Route 53 records (route53 mode only) | Issued, DNS-validated, and ALIAS-linked to the ALB automatically. |
| ECS cluster + Fargate service | Runs the worker container. Min 1 task, max 20 by default. |
| Auto-scaling policy | Target tracking on CPU at 50%. |
| S3 bucket | Transcript JSONL at transcripts/<meeting_uuid>/<ns-epoch>.jsonl (one PUT per chunk). SSE-S3, versioned, 30-day lifecycle to Glacier IR. |
| CloudWatch log group | /aws/ecs/<project>-worker, 30-day retention. |
| CloudWatch alarms | ALB 5xx > 5/min, no healthy hosts, ECS service unhealthy. |
| CloudWatch dashboard | ALB requests + 5xx, ECS CPU + memory, ALB healthy host count. |
| AWS Budgets alert | 80% + 100% of monthly_budget_usd → email. |
| Secrets Manager references | Three pre-existing secrets injected into the task as env vars. |
After deploy.sh (or terraform apply) finishes:
| Output | Use |
|---|---|
webhook_url |
Paste into Zoom Marketplace → Event Subscriptions |
alb_dns_name |
Point your DNS record here (external mode only) |
transcript_bucket |
Where .jsonl transcripts land |
ecs_cluster_name / ecs_service_name |
For aws ecs debugging |
log_group_name |
CloudWatch Logs Insights queries |
| State | Approx. monthly |
|---|---|
| Idle (0 meetings) | |
| 10 concurrent × 1 hr/day | ~$22 |
| 100 concurrent × 1 hr/day | ~$30 |
A monthly_budget_usd tfvar wires up an AWS Budgets alert at 80%. Run bash scripts/teardown.sh when not testing — the ALB is the only resource that bills meaningfully when idle.
- Fresh AWS account in
us-east-1 max_capacity = 2,task_cpu = 256,task_memory = 512terraform destroyimmediately after each session- Realistic budget: ~$3 for a full end-to-end test pass
cd worker
python3 -m venv .venv && source .venv/bin/activate # or: uv venv && source .venv/bin/activate
pip install -r requirements-dev.txt # or: uv pip install -r requirements-dev.txt
pytest # 48 tests, ~1 secondThe same worker/main.py runs locally. Copy .env.example to .env and fill in your credentials. The file is annotated with # DEV: <value> comments next to the lines that should be flipped for local development — progressive SDK logs, DEBUG Python logs, on-disk transcripts, smaller pools, faster reconnect. Leave the production-shape values as-is when you want a CloudWatch-bound test run. .env is gitignored. In Fargate no .env file exists; secrets come from Secrets Manager.
.envfiles are local-only. In production (Fargate, EC2, Kubernetes, anywhere customer-facing), secrets must come from AWS Secrets Manager via the ECS task definition'ssecretsblock — never baked into a Docker image, never copied onto a server, never committed to git. The Terraform in this repo wires Secrets Manager → ECS automatically;deploy.shhandles populating Secrets Manager from your input (or detects existing entries by name and re-uses them).
Point an ngrok tunnel at port 8080 to receive Zoom webhooks locally.
# Webhook URL validation (CRC handshake — no real meeting needed)
bash scripts/send-test-webhook.sh validation
# expect: HTTP 200 with {"plainToken": ..., "encryptedToken": ...}
# HMAC signature rejection
bash scripts/send-test-webhook.sh invalid
# expect: HTTP 401; check the WebhookSignatureFailed CloudWatch metric
# Real RTMS smoke — start a Zoom meeting with RTMS on, then:
aws logs tail /aws/ecs/rtms-demo-worker --follow
aws s3 ls s3://<transcript_bucket>/transcripts/ --recursive
# Autoscale check — synth load via a small loop of webhooks
bash scripts/send-test-webhook.sh burst 20
aws ecs describe-services --cluster rtms-demo-cluster --services rtms-demo-worker \
--query 'services[0].desiredCount'
# Failover — stop a running task, watch ECS replace it within ~60s
aws ecs list-tasks --cluster rtms-demo-cluster
aws ecs stop-task --cluster rtms-demo-cluster --task <task-arn>bash scripts/teardown.shForce-empties the S3 bucket, runs terraform destroy, and verifies no orphaned ENIs remain.
This is a POC reference, not a production-hardened deployment. Key trade-offs you should know about:
Per Zoom's RTMS reconnection contract, the worker handles all three failover scenarios using the SDK's events (not raw WebSockets):
| Scenario | Trigger | Worker action |
|---|---|---|
| 1. RTMS server failure | meeting.rtms_started arrives with same rtms_stream_id but different server_urls |
Leave old client, join with new payload |
| 2. Signal connection down | meeting.rtms_interrupted webhook |
Leave old client, join with the webhook's fresh credentials |
| 3. Media connection down only | SDK on_media_connection_interrupted callback |
Re-issue client.join(stored_payload) with exponential backoff (3s/6s/12s/24s/30s) up to 5 attempts |
Reference: zoom/rtms-samples/rtms_api/reconnection_and_chaos_mode_js — same contract, raw-WebSocket implementation. Our worker does it through SDK callbacks.
- Task crash → ECS replaces the task within ~60s. In-flight meetings on the dead task are dropped (webhook is already ack'd, no replay). Future webhooks route to healthy tasks via the ALB.
- AZ outage → ECS places replacement tasks in the healthy AZ.
- Spot interruption → SIGTERM handler calls
client.leave()on all active meetings inside the 30s Fargate stop grace; future webhooks land elsewhere.
- Public-subnet workers. Saves ~$32/mo NAT GW cost. Security group locks ingress to the ALB. For private subnets + NAT, set
use_private_subnets = true. - Fargate Spot. ~70% cheaper, but tasks can be reclaimed with 2-minute notice. Mix Spot + on-demand via
spot_weight/ondemand_weightfor a stable baseline. - One S3 PUT per transcript chunk. Simple to reason about, but produces lots of small objects. Production fix: buffer ~30s in memory and PUT a multi-line JSONL, or stream to Kinesis Firehose.
- All meeting state lives in worker memory. This is the most important trade-off to understand. The worker keeps three in-process maps for each active stream — the
rtms.Clientinstances (clients), the cached webhook payloads used for reconnection (stream_payloads), and per-stream reconnect-attempt counters (reconnect_attempt). Plus the in-process webhook dedup table. All of these are lost on task restart. If a Fargate task crashes mid-meeting (Spot reclaim, OOM, AZ blip), the in-flight meeting drops — the next webhook for that stream lands on a fresh task with no record of the previous join, and Zoom's join signature is time-bound so re-joining isn't always possible. Production fix: SQS-replay architecture where webhook ingest is decoupled from the worker fleet (see spec.md → Future Work). Cross-task dedup would similarly need DynamoDB / ElastiCache — but the rtms SDK rejects duplicate joins anyway, so per-task dedup is belt-and-suspenders.
Set zoom_host = "https://zoomgov.com" in terraform.tfvars. The worker maps URLs accordingly. ALB / ECS / DNS are unchanged.
- Zoom webhook validation fails:
ZM_RTMS_WEBHOOK_SECRETin Secrets Manager must match the Marketplace app's Secret Token exactly. Tail logs:aws logs tail /aws/ecs/<project>-worker --follow. - ALB returns 502 / 503: target group health check failing. The ALB hits
GET /on port 8080; matcher is200-299(the worker subclassesWebhookHandlerto return 200 to GET — without that, the SDK returns 501 which can't match within ALB's allowed 200-499 range). - Worker can't pull image: task execution role missing ECR read, or the image doesn't exist at the URI in
worker_image. Check the ECR repo + image tag. - Transcripts not appearing:
on_join_confirmnot firing. Usually a Zoom-side scope or RTMS-enablement issue; check the Marketplace app config and confirmmeeting.rtms_started,meeting.rtms_stopped,meeting.rtms_interruptedare subscribed events. failed to open file [/app/logs/...] errno = 2in logs: rebuild the worker image — image tag0.1.3or later includesmkdir -p /app/logsin the Dockerfile. The C++ RTMS SDK writes internal log files to./logs/relative to CWD; the container'sWORKDIR /appresolves that to/app/logs/which must exist.terraform destroyhangs on VPC: runbash scripts/teardown.sh— it force-deletes Fargate ENIs.
rtms-terraform-aws/
├── deploy.sh # one-command guided setup
├── README.md # this file
├── docs/
│ └── MANUAL_SETUP.md # step-by-step walkthrough (what deploy.sh automates)
├── terraform.tfvars.example # all inputs documented
├── .env.example # worker env-var contract (prod-shape with DEV: overrides)
├── main.tf / variables.tf / outputs.tf / versions.tf
├── modules/ # network, alb, worker, storage, observability
├── worker/ # Python RTMS consumer (Dockerfile + main.py + tests)
├── scripts/
│ ├── bootstrap-state.sh # S3 backend setup
│ ├── build-and-push-worker.sh # build + push worker image to ECR
│ ├── send-test-webhook.sh # signed-webhook smoke harness
│ ├── teardown.sh # destroy + ENI cleanup
│ └── validate-terraform.sh # fmt + validate + tflint + checkov
└── .github/ # dependabot + CI workflows