fix: CI test run stability improvements by TravisStark · Pull Request #1832 · aws-observability/aws-otel-test-framework

TravisStark · 2026-05-13T13:29:20Z

Title:

fix: CI stability — prevent instance leaks, reduce
timeouts, add subnet capacity

Details

Trap signals in executeTerraformTest.sh so cancelled runs
always destroy resources
Bail on failed destroy instead of retrying on broken
state (root cause of 668-instance leak)
Run terraform init before destroy in cleanup script to
fix "Module not installed" failures
Reduce apply timeout 45m→30m, retries 2→1 for faster
failure
Bump EKS kubernetes_service create timeout 20m→30m to
accommodate Classic ELB tail latency
Parallelize EC2 wait-patch calls (15m worst case, was
30m)
Wire in 4th public subnet (us-west-2a-1, 251 IPs) for
additional capacity
Append cache misses to file for downstream visibility
Pin VPC module version

- Switch kubernetes/kubectl/helm providers from static token to exec-based auth (aws eks get-token). Static tokens expire after 15m, causing Unauthorized errors on long-running applies. - Switch LoadBalancer services to NLB (1-2 min provision vs 15-20 min for Classic ELBs under concurrency pressure). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace docker-compose v1.27.4 with built-in docker compose v2 - Add --quiet pull to suppress layer-by-layer noise in logs - Add ::group:: annotations for collapsible log sections per testcase - Add timestamped phase markers (init, apply, destroy) - Write step summary table with pass/fail and duration per testcase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The AL2 Docker package from amazon-linux-extras doesn't include the compose plugin. Install it as a CLI plugin from GitHub releases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Batch generator now groups tests by platform (EC2/0, EKS/1, etc.) so batches never mix platforms. Easier to identify platform-specific failures in the sidebar. - RetryHelper now logs the exception message on each retry attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ment (all) ALBs and EFS mount targets reject multiple subnets in the same AZ. Keep original 3-subnet list for those consumers, use expanded 4-subnet list only for random instance placement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The validator runs in a Docker container without the AWS CLI, so exec-based auth in the kubeconfig doesn't work. Generate a fresh token at apply time via data.external and embed it. The validator only runs for ~2 min so the 15-min token TTL is sufficient. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Under high concurrency (110 batch jobs), X-Ray ingestion can take longer than the 100s window (5 retries × 20s). Doubling to 10 retries gives 200s which should cover Kafka pipeline latency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sidebar now shows EC2/amazonlinux2/0, EC2/windows2022/0, EKS/us-west-2|cluster-name/0, ECS/EC2/0, etc. Makes it trivial to identify AMI-specific or cluster-specific failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove yum update -y (causes RPMTransaction fatal errors under load) - Add 3-retry loop on amazon-linux-extras install docker - Use systemctl start docker (more reliable than service docker start) - Reduce sleep 30 to sleep 10 (docker starts in <5s) - Bump X-Ray trace validation retries from 10 to 15 (300s window) for Kafka pipeline tests under high concurrency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Keeps system packages current while preventing RPM transaction failures from aborting the provisioner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The spark sample app emits totalApiBytesSent reliably but apiBytesSent only appears after a full EMF aggregation window. Under high concurrency this window often doesn't complete before validation retries exhaust. Replaced with totalApiBytesSent which is always present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…etries - Remove /var/run/docker.sock and /var/lib/docker mounts from the container insights daemonset. These don't exist on AL2023/containerd 2.1 (EKS 1.33+) and cause pod scheduling warnings. - Bump default MAX_RETRIES from 10 to 15 (300s window) to handle ECS task startup latency under high concurrency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- MAX_RETRIES 15 -> 20 (400s default window for mocked server/CW validation) - X-Ray trace inner retry 15 -> 20 (400s for Kafka pipeline ingestion) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rely on 20-retry bump (400s) to find it instead of removing validation coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Increase haproxy/nginx helm release timeout from 300s to 600s to fix containerinsight_eks_prometheus failures on busy clusters - Remove protocol_version from kafka test configs; MSK clusters report "3.9.x" which is unparseable by the collector's kafka client — let the client auto-negotiate instead - Strip region and collector-ci- prefix from batch job display names for readability (EKS/amd64-1-33 instead of EKS/us-west-2|collector-ci-amd64-1-33)

Docker 25.0.14-1.amzn2.0.5 breaks port forwarding on Amazon Linux 2, causing the validator to be unable to reach the sample app on port 8080 despite SSH (port 22) working to the same host. Pin to the last known working version.

Replace all 0.0.0.0/0 ingress rules in the shared security group with 10.0.0.0/16 (VPC-internal only). Add an ephemeral per-run security group in the EC2 module that discovers the GH Actions runner's public IP via checkip.amazonaws.com and opens only the required ports (22, 80, 8080, 5985) from that single IP for the duration of the test. This addresses Sev-2 ticket D452320721 where an external scanner exploited the open WinRM port. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ECS tests use public-facing ALBs that share this SG. HTTP ports are non-sensitive (no auth, ephemeral test data). Only SSH/WinRM/RDP remain restricted to VPC CIDR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

checkCacheHits was running sequentially taking ~10min. With 20-way parallelism it should complete in under 30s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sends shell/PowerShell commands to instances via SSM RunCommand, waits for completion, and streams output. Auto-detects platform. Building block for VPC migration (replacing remote-exec provisioners). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All file provisioners now upload to S3 and pull via SSM. All remote-exec provisioners now use aotutil ssm run-command. No more direct SSH/WinRM connections from the runner to instances. Converted resources: - setup_mocked_server_cert_for_windows (file+remote-exec -> S3+SSM) - setup_mocked_server_cert_for_linux (file+remote-exec -> S3+SSM) - download_collector_from_local (file -> S3+SSM) - download_collector_from_s3 (remote-exec -> SSM) - collector_file_configuration (file -> S3+SSM) - start_collector (remote-exec -> SSM) - install_collector_from_ssm (remote-exec wait -> SSM) - setup_sample_app_and_mock_server (file+remote-exec -> S3+SSM) - install_cwagent (file+remote-exec -> S3+SSM) - ssm_validation (remote-exec -> SSM) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Since all provisioning now uses SSM, the per-run SG only needs HTTP ports (80, 8080) for the validator to reach the sidecar. Removed login_user, connection_type locals and get_password_data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The per-run SG was hitting the account's SG quota (3000, mostly consumed by 2383 orphaned k8s-elb-* SGs). Since provisioning now uses SSM (no SSH/WinRM), and HTTP ports 80/8080 are already open on the shared SG for ALBs, the per-run SG is unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… region fixes) The SSM approach hit two issues in CI: 1. aotutil binary not found (relative path doesn't exist on runner) 2. S3 bucket aws-otel-collector-test is in us-east-1, not us-west-2 Reverting to per-run SG + remote-exec which now works after cleaning 2000+ orphaned k8s-elb security groups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sioners The aws-otel-collector-test bucket is in us-east-1 but the default provider is us-west-2. Added provider alias 'aws.s3' pointing to us-east-1 for all aws_s3_object resources. Also re-enables the SSM provisioner migration (reverts the revert). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove public IPs from both AOC and sidecar instances - Remove WinRM config from Windows user_data (no longer needed) - Move validator to run on the sidecar via SSM (localhost endpoints) - Upload validator source to S3, build on sidecar, run via docker compose - Switch sample_app_listen_address_host to private_ip Zero public ingress surface. All instances in private subnets with NAT gateway for outbound only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three fixes for SSM provisioner failures: 1. PATH: Prepend /usr/local/bin to PATH for Linux SSM commands (fixes 'aws: command not found' on Ubuntu/RedHat) 2. --stdin flag: Read commands from stdin (one per line) to avoid shell quoting issues when commands contain $(), &, or nested quotes 3. start_collector + ssm_validation: Use heredoc + --stdin to pass commands without local shell interpretation (fixes presigned URL & escaping and Windows PowerShell syntax errors) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove public IPs from both AOC and sidecar instances - Remove WinRM config from Windows user_data - Build validator image on runner (fast), upload to S3 as tarball - Sidecar pulls and loads pre-built image (docker load, no build needed) - Validator runs with network_mode:host using localhost endpoints Zero public ingress surface. Runner build (~60s) replaces sidecar build (~10min). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Public subnets + no public IP = no internet. Instead, place instances in private subnets which route through the existing NAT gateway. - basic_components: random_subnet uses private subnets now - EC2: associate_public_ip_address = false - ECS: EFS EC2 + launch template = false - ALBs still use public subnets (correct for internet-facing LBs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Re-enable public IPs and SSH from anywhere to match Jeffrey's working run on test/fix-ami. The private subnet + SSM-only approach needs more work (runners keep dying). For now, use the pattern that works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stock AL2 AMI had Docker daemon instability during validator builds. CWAgent AMIs are pre-configured with SSM agent and stable Docker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. SUSE 15: wait for RPM lock release before installing collector 2. ssm-install-aoc.sh: exponential backoff with jitter on API calls 3. NodeDiskIO.json: allow additional properties from containerd 2.x 4. Windows: remove -Verbose from Import-Certificate (eliminates 40K lines of cert PEM dump in logs, saves 3-5 min per test) 5. Windows: use Start-Process msiexec -Wait with exit code check (msiexec returns 0 even on failure in certain SSM contexts)

…r-compose, quiet+readable logs 1. SUSE 15: wait for RPM lock release before installing collector 2. ssm-install-aoc.sh: exponential backoff with jitter on API calls 3. NodeDiskIO.json: allow additional properties from containerd 2.x 4. Windows: remove -Verbose from Import-Certificate 5. Windows: use Start-Process msiexec -Wait with exit code check 6. Docker-compose: download from S3 instead of GitHub 7. Mark cert output as sensitive (suppresses plan noise) 8. Quiet terraform: init silenced, destroy silenced, -compact-warnings 9. Clear test progress banners with test number and pass/fail status

1. Try S3 first, fall back to GitHub if not in S3 yet 2. Replace 'aws ecr get-login' (v1) with 'aws ecr get-login-password' (v2)

docker load puts the image in with the tag, docker compose should reference that tag not try to build from a nonexistent path.

TravisStark requested a review from a team as a code owner May 13, 2026 13:29

fix: CI test run stability improvements

b8eda22

TravisStark force-pushed the fix/vpc-tf-version branch from 6b9492d to b8eda22 Compare May 13, 2026 13:51

TravisStark and others added 4 commits May 13, 2026 11:48

add logging

ebdd638

fix: install docker compose v2 plugin on EC2 sidecar

5616328

The AL2 Docker package from amazon-linux-extras doesn't include the compose plugin. Install it as a CLI plugin from GitHub releases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alexperez52 approved these changes May 13, 2026

View reviewed changes

TravisStark and others added 22 commits May 13, 2026 13:45

fix: restore yum update with --skip-broken and || true

6c726d2

Keeps system packages current while preventing RPM transaction failures from aborting the provisioner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: bump retries to 20 for Kafka/ECS reliability

7c498ef

- MAX_RETRIES 15 -> 20 (400s default window for mocked server/CW validation) - X-Ray trace inner retry 15 -> 20 (400s for Kafka pipeline ingestion) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

revert: restore apiBytesSent in expected metric templates

9a94c53

Rely on 20-retry bump (400s) to find it instead of removing validation coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: parallelize DynamoDB cache-hit checks (P1 -> P20)

0a8a57d

checkCacheHits was running sequentially taking ~10min. With 20-way parallelism it should complete in under 30s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TravisStark and others added 4 commits May 14, 2026 14:36

TravisStark force-pushed the fix/vpc-tf-version branch from 729b344 to a43fa84 Compare May 14, 2026 23:52

jefchien and others added 10 commits May 15, 2026 06:41

Switch to more cloudwatch agent AMIs

1fa7640

Allow public IP address

7ad6e91

Use shared AMIs from CloudWatch agent

8716b53

fix: use CWAgent shared AMI for sidecar (pre-installed SSM + Docker)

149ea84

Stock AL2 AMI had Docker daemon instability during validator builds. CWAgent AMIs are pre-configured with SSM agent and stable Docker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Disable patching delays

f028be8

fix: S3 docker-compose with GitHub fallback, fix ECR login for CLI v2

b35bd68

1. Try S3 first, fall back to GitHub if not in S3 yet 2. Replace 'aws ecr get-login' (v1) with 'aws ecr get-login-password' (v2)

fix: use image tag instead of build path in validator compose

c8d380a

docker load puts the image in with the tag, docker compose should reference that tag not try to build from a nonexistent path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CI test run stability improvements#1832

fix: CI test run stability improvements#1832
TravisStark wants to merge 41 commits into
release/v0.48.xfrom
fix/vpc-tf-version

TravisStark commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TravisStark commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title:

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TravisStark commented May 13, 2026 •

edited

Loading